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Abstract. This paper provides inference methods for best linear approximations to func- 
tions which are known to lie within a band. It extends the partial identification literature 
by allowing the upper and lower functions defining the band to be any functions, including 
ones carrying an index, which can be estimated parametrically or non-parametrically. The 
identification region of the parameters of the best linear approximation is characterized 
via its support function, and limit theory is developed for the latter. We prove that the 
support function approximately converges to a Gaussian process and establish validity of 
the Bayesian bootstrap. The paper nests as special cases the canonical examples in the 
literature: mean regression with interval valued outcome data and interval valued regres- 
sor data. Because the bounds may carry an index, the paper covers problems beyond 
mean regression; the framework is extremely versatile. Applications include quantile and 
distribution regression with interval valued data, sample selection problems, as well as 
mean, quantile, and distribution treatment effects. Moreover, the framework can account 
for the availability of instruments. An application is carried out, studying female labor 
force participation along the lines of Mulligan and Rubinstein (2008). 
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1. Introduction 

This paper contributes to the literature on estimation and inference for best linear ap- 
proximations to set identified functions. Specifically, we work with a family of functions 
f(x,a) indexed by some parameter a £ A, that is known to satisfy 9q(x, a) < f(x,a) < 
9\(x, a) x — a.s., with x E M. d a vector of regressors. Econometric frameworks yielding such 
restriction are ubiquitous in economics and in the social sciences, as illustrated by Man- 
ski (2003, 2007). Cases explicitly analyzed in this paper include: (1) mean regression; (2) 
quantile regression; and (3) distribution and duration regression, in the presence of interval 
valued data, including hazard models with interval- valued failure times; (4) sample selection 
problems; (5) mean treatment effects; (6) quantile treatment effects; and (7) distribution 
treatment effects, see Section [3] for detailsQ Yet, the methodology that we propose can be 
applied to virtually any of the frameworks discussed in Manski (2003, 2007). In fact, our 
results below also allow for exclusion restrictions that yield intersection bounds of the form 
sup v£V 0q(x , v , a) = 6o(x,a) < f(x,a) < 9\(x,a) = inf„ e y 9\(x, v, a) x — a.s., with v an 
instrumental variable taking values in a set V. The bounding functions 6o(x, a) and 9i(x, a) 
may be indexed by a parameter a £ A and may be any estimable function of x. 

Our method appears to be the first and currently only method available in the litera- 
ture for performing inference on best linear approximations to set identified functions when 
the bounding functions 9o(x,a) and 9\{x,a) need to be estimated. Moreover, we allow 
for the functions to be estimated both parametrically as well as non-parametrically via se- 
ries estimators. Previous closely related contributions by Beresteanu and Molinari (2008) 
and Bontemps, Magnac, and Maurin (2010) provided inference methods for best linear ap- 
proximations to conditional expectations in the presence of interval outcome data. In that 
environment, the bounding functions do not need to be estimated, as the set of best linear 
approximations can be characterized directly through functions of moments of the observ- 
able variables. Hence, our paper builds upon and significantly generalizes their results. 
These generalizations are our main contribution and are imperative for many empirically 
relevant applications. 

Our interest in best linear approximations is motivated by the fact that when the restric- 
tion 8o(x,a) < f(x,a) < 9i(x,a) x — a.s. summarizes all the information available to the 
researcher, the sharp identification region for /(•,«) is given by the set of functions 

5" (a) = {(p(-,a) : 9q(x, a) < 4>(x, a) < 9i(x, a) x — a.s.} 

The set $ (a) , while sharp, can be difficult to interpret and report, especially when x is 
multi-dimensional. Similar considerations apply to related sets, such as for example the set 
of marginal effects of components of x on / (x, a) . Consequently, in this paper we focus 
on the sharp set of parameters characterizing best linear approximations to the functions 
comprising $ (a). This set is of great interest in empirical work because of its tractability. 
Moreover, when the set identified function is a conditional expectation, the corresponding 



For example, one may be interested in the a-conditional quantile of a random variable y given x, 
denoted Q y (a\x) , but only observe interval data [yo, j/i] which contain y with probability one. In this case, 
/ (x, a) = Q y (a\x) and 6t (x, a) = Qt (a\x) , I = 0, 1, the conditional quantiles of properly specified random 
variables. 
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set of best linear approximations is robust to model misspecification (Ponomareva and 
Tamer (2010)). 

In practice, we propose to estimate the sharp set of parameter vectors, denoted B (a), cor- 
responding to the set of best linear approximations. Simple linear transformations applied 
to B (a) yield the set of best linear approximations to / (x, a) , the set of linear combina- 
tions of components of b £ B(a), bounds on each single coefficient, etc. The set B(a) is 
especially tractable because it is a transformation, through linear operators, of the random 
interval [8o(x,a),0i(x,a)], and therefore is convex. Hence, inference on B(a) can be car- 
ried out using its support function a (q, a) = sup b6B ( a ) q'b, where q £ S d_1 is a direction in 
the unit sphere in d dimensions!! Beresteanu and Molinari (2008) and Bontemps, Magnac, 
and Maurin (2010) previously proposed the use of the support function as a key tool to 
conduct inference in best linear approximations to conditional expectation functions. An 
application of their results gives that the support function of B(a) is equal to the expec- 
tation of a function of (8q(x, a), 9\(x, a), x, E (xx')) . Hence, an application of the analogy 
principle suggests to estimate o~(q, a) through a sample average of the same function, where 
Qo(x,a) and #i(x,a) are replaced by parametric or non-parametric estimators, and E (xx') 
is replaced by its sample analog. The resulting estimator, denoted a(q,a), yields an esti- 
mator for B(a) through the characterization in equation (12. 2p below. We show that &(q, a) 
is a consistent estimator for a(q,a), uniformly over q, a £ x A. We then establish 

the approximate asymptotic Gaussianity of our set estimator. Specifically, we prove that 
when properly recentered and normalized, a(q,a) approximately converges to a Gaussian 
process on S^ 1 x A (we explain below what we mean by "approximately"). The covariance 
function of this process is quite complicated, so we propose the use of a Bayesian bootstrap 
procedure for practical inference, and we prove consistency of this bootstrap procedure. 

Because the support function process converges on S^ 1 x A, our asymptotic results 
also allow us to perform inference on statistics that involve a continuum of values for q 
and/or a. For example, for best linear approximations to conditional quantile functions in 
the presence of interval outcome data, we are able to test whether a given regressor Xj has 
a positive effect on the conditional a-quantile for any a £ A. 

In providing a methodology for inference, our paper overcomes significant technical com- 
plications, thereby making contributions of independent interest. First, we allow for the 
possibility that some of the regressors in x have a discrete distribution. In order to conduct 
test of hypothesis and make confidence statements, both Beresteanu and Molinari (2008) 
and Bontemps, Magnac, and Maurin (2010) had explicitly ruled out discrete regressors, as 
their presence greatly complicates the derivation of the limiting distribution of the support 
function process. By using a simple data-jittering technique, we show that these compli- 
cations completely disappear, albeit at the cost of basing statistical inference on a slightly 
conservative confidence set. 



2 "The support function (of a nonempty closed convex set B in direction q) a B (q) is the signed distance 
of the support plane to B with exterior normal vector q from the origin; the distance is negative if and only 
if q points into the open half space containing the origin," Schneider (1993, page 37). See Rockafellar (1970, 
Chapter 13) or Schneider (1993, Section 1.7) for a thorough discussion of the support function of a closed 
convex set and its properties. 
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Second, when 8q(x,o) and 0\(x, a) are non-par ametrically estimated through series esti- 
mators, we show that the support function process is approximated by a Gaussian process 
that may not necessarily converge as the number of series functions increases to infinity. 
To solve this difficulty, we use a strong approximation argument and show that each subse- 
quence has a further subsequence converging to a tight Gaussian process with a uniformly 
equicontinuous and non-degenerate covariance function. We can then conduct inference 
using the properties of the covariance function. 

To illustrate the use of our estimator, we revisit the analysis of Mulligan and Rubinstein 
(2008). The literature studying female labor force participation has argued that the gender 
wage gap has shrunk between 1975 and 2001. Mulligan and Rubinstein (2008) suggest 
that women's wages may have grown less than men's wages between 1975 and 2001, had 
their behavior been held constant, but a selection effect induces the data to show the gender 
wage gap contracting. They point out that a growing wage inequality within gender induces 
women to invest more in their market productivity. In turn, this would differentially pull 
high skilled women into the workplace and the selection effect may make it appear as if 
cross-gender wage inequality had declined. 

To test this conjecture they employ a Heckman selection model to correct married 
women's conditional mean wages for selectivity and investment biases. Using CPS repeated 
cross-sections from 1975-2001 they argue that the selection of women into the labor market 
has changed sign, from negative to positive, or at least that positive selectivity bias has come 
to overwhelm investment bias. Specifically, they find that the gender wage gap measured 
by OLS decreased from -0.419 in 1975-1979 to -0.256 in 1995-1999. After correcting for 
selection using the classic Heckman selection model, they find that the wage gap was -0.379 
in 1975-1979 and -0.358 in 1995-1999, thereby concluding that correcting for selection, the 
gender wage gap may have not shrunk at all. Because it is well known that without a strong 
exclusion restriction results of the normal selection model can be unreliable, Mulligan and 
Rubinstein conduct a sensitivity analysis which corroborates their findings. 

We provide an alternative approach. We use our method to estimate bounds on the 
quantile gender wage gap without assuming a parametric form of selection or a strong 
exclusion restriction. We are unable to reject that the gender wage gap declined over the 
period in question. This suggests that the instruments may not be sufficiently strong to 
yield tight bounds and that there may not be enough information in the data to conclude 
that the gender gap has or has not declined from 1975 to 1999 without strong functional 
form assumptions. 

Related Literature. This paper contributes to a growing literature on inference on set- 
identified parameters. Important examples in the literature include Andrews and Jia (2008), 
Andrews and Shi (2009), Andrews and Soares (2010), Beresteanu and Molinari (2008), Bon- 
temps, Magnac, and Maurin (2010), Bugni (2010), Canay (2010), Chernozhukov, Hong, and 
Tamer (2007), Chernozhukov, Lee, and Rosen (2009), Galichon and Henry (2009), Kaido 
(2010), Romano and Shaikh (2008), Romano and Shaikh (2010), and Rosen (2008), among 
others. Beresteanu and Molinari (2008) propose an approach for estimation and inference 
for a class of partially identified models with convex identification region based on results 
from random set theory. Specifically, they consider models where the identification region is 
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equal to the Aumann expectation of a properly defined random set that can be constructed 
from observable random variables. Extending the analogy principle, Beresteanu and Moli- 
nari suggest to estimate the Aumann expectation using a Minkowski average of random sets. 
Building on the fundamental insight in random set theory that convex compact sets can be 
represented via their support functions (see, e.g., Artstein and Vitale (1975)), Beresteanu 
and Molinari accordingly derive asymptotic properties of set estimators using limit the- 
ory for stochastic processes. Bontemps, Magnac, and Maurin (2010) extend the results of 
Beresteanu and Molinari (2008) in important directions, by allowing for incomplete linear 
moment restrictions where the number of restrictions exceeds the number of parameters to 
be estimated, and extend the familiar Sargan test for overidentifying restrictions to par- 
tially identified models. Kaido (2010) establishes a duality between the criterion function 
approach proposed by Chernozhukov, Hong, and Tamer (2007), and the support function 
approach proposed by Beresteanu and Molinari (2008). 

Concurrently and independently of our work, Kline and Santos (2010) study the sensitiv- 
ity of empirical conclusions about conditional quantile functions to the presence of missing 
outcome data, when the Kolmogorov-Smirnov distance between the conditional distribution 
of observed outcomes and the conditional distribution of missing outcomes is bounded by 
some constant k across all values of the covariates. Under these assumptions, Kline and 
Santos show that the conditional quantile function is sandwiched between a lower and an 
upper band, indexed by the level of the quantile and the constant k. To conduct inference, 
they assume that the support of the covariates is finite, so that the lower and upper bands 
can be estimated at parametric rates. Kline and Santos' framework is a special case of our 
sample selection example listed above. Hence, our results significantly extend the scope of 
Kline and Santos' analysis, by allowing for continuous regressors. Moreover, the method 
proposed in this paper allows for the upper and lower bounds to be non-parametrically 
estimated by series estimators, and allows the researcher to utilize instruments. While 
technically challenging, allowing for non-parametric estimates of the bounding functions 
and for intersection bounds considerably expands the domain of applicability of our results. 

Structure of the Paper. This paper is organized as follows. In Section [2] we develop 
our framework, and in Section [3] we demonstrate its versatility by applying it to quantile 
regression, distribution regression, sample selection problems, and treatment effects. Section 
H] provides an overview of our theoretical results and describes the estimation and inference 
procedures. Section [5] gives the empirical example. Section [6] concludes. All proofs are in 
the Appendix. 

2. The General Framework 

We aim at conducting inference for best linear approximations to the set of functions 

5" (a) = {(/>(■, a) : 9q(x, a) < (p(x, a) < 6i(x, a) x — a.s.} 

Here, a € A is some index with A a compact set, and x is a column vector in W d . For 
example, in quantile regression a denotes a quantile; in duration regression a denotes a 
failure time. For each x, 9o(x,a) and 9i(x,a) are point-identified lower and upper bounds 
on a true but non-point-identified function of interest f(x, a). If f(x, a) were point identified, 
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we could approximate it with a linear function by choosing coefficients /5(a) to minimize 
the expected squared prediction error ~E{(f(x, a) — x'/3 (a)) 2 ]. Because f(x, a) is only known 
to lie in 5 (a) , performing this operation for each admissible function (/>(•, a) £5 (a) yields 
a set of observationally equivalent parameter vectors, denoted B (a): 

B(a) ={(3 £ argminE[(0(x, a) — x b) 2 ] : P (6 (x, a) < <f>{x, a) < 9x(x, a)) = 1} 

6 

={P = E[xs']~ 1 E[x0(a; J a)] : P (6 (x, a) < <f>(x, a) < Ox(x t a)) = 1}. (2.1) 

It is easy to see that the set B [a) is almost surely non-empty, compact, and convex valued, 
because it is obtained by applying linear operators to the (random) almost surely non-empty 
interval [6q(x , a) , 0±(x , a)] , see Beresteanu and Molinari (2008, Section 4) for a discussion. 
Hence, B (a) can be characterized quite easily through its support function 

a (q, a) := sup q' f3 (a) , 

i3(a)6B(a) 

which takes on almost surely finite values Vq G S^ 1 := |gGlR d : \\q\\ = l}, d = dim/3. In 
fact, 

B(a)= n {b:q'b<a(q,a)}, (2.2) 

see Rockafellar (1970, Chapter 13). Note also that [—a (— q, a) , a (q, a)] gives sharp bounds 
on the linear combination of (3 (a)'s components obtained using weighting vector q. 

More generally, if the criterion for"best" linear approximation is to minimize E[(/(x, a) — 
x 1 (3(a))z'Wz(f(x,a) — x' /3(a))], where W is a j x j weight matrix and z a j x 1 vector of 
instruments, then we have 

B(a) = {P = V[xz'Wzx']- l V[xz'Wz(j){x 7 a)] : P (0 o (x, a) < <j)(x, a) < x {x, a)) = 1}. 

As in Bontemps, Magnac, and Maurin (2010), Magnac and Maurin (2008), and Beresteanu 
and Molinari (2008, p. 807) the support function of B(a) can be shown to be 

a(q,a) = E[z q w g ] 

where 

z = xz'Wz, z q = q''E[xz']~ 1 z, 
w q = 8i(x,a)l(z q > 0) + 6 (x,a)l(z q < 0). 

We estimate the support function by plugging in estimates of 0£, I = 0, 1, and taking 
empirical expectations: 

a(q,a) = E n q (E n [xjZ-]) _1 Zi [0i(xi, a)l(z iq > 0) + 9 (xi, a)l(z iq < 0) 

where E n denotes the empirical expectation, % q = q' (E n [xjZ^]) -1 z, and 6i(x,a) are the 
estimators of 0£ (x, a), I = 0, 1. 
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3. Motivating Examples 

3.1. Interval valued data. Analysis of regression with interval valued data has become 
a canonical example in the partial identification literature. Interest in this example stems 
from the fact that dealing with interval data is a commonplace problem in empirical work. 
Due to concerns for privacy, survey data often come in bracketed form. For example, public 
use tax data are recorded as the number of tax payers which belong to each of a finite 
number of cells, as seen in Picketty (2005). The Health and Retirement Study provides 
a finite number of income brackets to each of its respondents, with degenerate brackets 
for individuals who opt to fully reveal their income level; see Juster and Suzman (1995) 
for a description. The Occupational Employment Statistics (OES) program at the Bureau 
of Labor Statistics collects wage data from employers as intervals, and uses these data to 
construct estimates for wage and salary workers in 22 major occupational groups and 801 
detailed occupations!! 

3.1.1. Interval valued y. Beresteanu and Molinari (2008) and Bontemps, Magnac, and Mau- 
rin (2010), among others, have focused on estimation of best linear approximations to con- 
ditional expectation functions with interval valued outcome data. Our framework covers the 
conditional expectation CclSG, clS well as an extension to quantile regression wherein we set 
identify [3(a) across all quantiles a £ A. To avoid redundancy with the related literature, 
here we describe the setup for quantile regression. Let the a-th conditional quantile of y\x 
be denoted Q y (a\x). We are interested in a linear approximation a/ /3(a) to this function. 
However, we do not observe y. Instead we observe yo and y\, with P (j/o < V < yi) = 1- It 
is immediate that 

Q yo (a\x) < Q y (a\x) < Q yi (a\x) x — a.s., 

where Q yi (a\x) is the a-th conditional quantile of yt\x, I = 0, 1. Hence, the identification 
region B(a) is as in equation (12. 1|) . with 0e(a,x) = Q ye (a\x). 

3.1.2. Interval valued X{. Suppose now that one is interested in the conditional expectation 
E (y\x) , but only observes y and variables xq, x\ such that P (xq < x < xi) = 1. This may 
occur, for example, when education data is binned into categories such as primary school, 
secondary school, college, and graduate education, while the researcher may be interested 
in the return to each year of schooling. It also happens when a researcher is interested in a 
model in which wealth is a covariate, but the available survey data report it by intervals. 

Our approach applies to the framework of Manski and Tamer (2002) for conditional 
expectation with interval regressors, and extends it to the case of quantile regression]! 
Following Manski and Tamer, we assume that the conditional expectation of y\x is (weakly) 
monotonic in x, say nondecreasing, and mean independent of xq, x\ conditional on x. Manski 



See Manski and Tamer (2002) and Bontemps, Magnac, and Maurin (2010) for more examples. 
4 Our methods also apply to the framework of Magnac and Maurin (2008), who study identification in 
semi-parametric binary regression models with regressors that are either discrete or measured by intervals. 
Compared to Manski and Tamer (2002), Magnac and Maurin's analysis requires an uncorrelated error 
assumption, a conditional independence assumption between error and interval/discrete valued regressor, 
and a finite support assumption. 
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and Tamer show that 

sup E (y\x , x\) < E (y\x) < inf E (y\x , xi) . 

Hence, the identification region B(a) is as in equation (|2.ip . with 9o(a, x) = sup Xi<x E (y\xo,xi) 
and 6i(a,x) = M Xo > x E(y\x ,x 1 ) . 

Next, suppose that the a-th conditional quantile of y\x is monotonic in x, say nonde- 
creasing, and that Q y (a\x, xq, xi) = Q y (a\x). By the same reasoning as above, 

sup Q y (a\xo, x\) < Q y (a\x) < inf Q y (a\xo, x\) . 

xi<x x 0>x 

Hence, the identification region B(a) is as in equation (12 .If) , with 0o(a, x) = sup Xl<x Q y (a\xo, x\) 
and 0i(a,x) = mf Xo > x Q y (a\x ,xi) ■ 

3.2. Distribution and duration regression with interval outcome data. Researchers 
may also be interested in distribution regression with interval valued data. For instance, 
a proportional hazard model with time varying coefficients where the probability of failure 
conditional on survival may be dependent on covariates and coefficients that are indexed 
by time. More generally, we can consider models in which the conditional distribution of 
y\x is given by 

P (y < a\x) = F y \ x {a\x) = $ (/(a, x)) 

where <1? (.) is a known one-to-one link function. A special case of this class of models is 
the duration model, wherein we have f(a,x) = g{a) + j(x), where g (.) is a monotonic 
function. 

As in the quantile regression example, assume that we observe (yo> Vi,x) with P (j/o < V < yi) = 
1. Then 

(F yi{x (a\x)) < f(a,x) < ^ (F yo{x (a\x)) . 

Hence, the identification region B(a) is as in equation (I2.ip . with 0£ (a, x) = <I> _1 (F yi _ l \ x (a\x)) , 
£ = 0, 1. A leading example, following Han and Hausman (1990) and Foresi and Peracchi 
(1995), would include $ as a probit or logit link function. 

3.3. Sample Selection. Sample selection is a well known first-order concern in the empir- 
ical analysis of important economic phenomena. Examples include labor force participation 
(see, e.g., Mulligan and Rubinstein (2008)), skill composition of immigrants (see, e.g., Jasso 
and Rosenzweig (2008)), returns to education (e.g., Card (1999)), program evaluation (e.g., 
Imbens and Wooldridge (2009)), productivity estimation (e.g., Olley and Pakes (1996)), 
insurance (e.g., Einav, Finkelstein, Ryan, Schrimpf, and Cullen (2011)), models with occu- 
pational choice and financial intermediation (e.g., Townsend and Urzua (2009)). In Section 
[5]we revisit the analysis of Mulligan and Rubinstein (2008) who confront selection in the 
context of female labor supply. 

Consider a standard sample selection model. We are interested in the behavior of y 
conditional on x\ however, we only observe y when u = 1. Manski (1989) observes that the 
following bound on the conditional distribution of y given x can be constructed: 

F(y\x,u = l)P(u = l\x) < F(y\x) < F(y\x,u = l)P(it = l\x) + P(n = 0|x). 
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The inverse image of these distribution bounds gives bounds on the conditional quantile 
function of y 

otherwise 

Ql (a\x) = x > u = 1 ) if«<P(« = l|*) 

otherwise 

where yo is the smallest possible value that y can take (possibly — oo) and y\ is the largest 
possible value that y can take (possibly +00). Thus, we obtain 

Qo (a\x) < Q y (a\x) < Qi (a\x) . 

and the corresponding set of coefficients of linear approximations to Q y (a\x) is as in equa- 
tion d23|), with 6 e (a, x) = Q e (a\x) , £ = 0, 1. 

3.3.1. Alternative Form for the Bounds. As written above, the expressions for Qo(a\x) and 
Qi(a\x) involve the propensity score, P (u\x) and several different conditional quantiles of 
y\u = 1. Estimating these objects might be computationally intensive. However, we show 
that Qo and Q\ can also be written in terms of the a-th conditional quantile of a different 
random variable, thereby providing computational simplifications. Define 

y = yl{ U =l}+y l{u = 0}, y x = yl {u = 1} + Vl 1 {u = 0} . (3.3) 

Then one can easily verify that Qy (a\x) = Qo(a\x), and Qy 1 (a\x) = Qi(a\x), and therefore 
the bounds on the conditional quantile function can be obtained without calculating the 
propensity score. 

3.3.2. Sample Selection With an Exclusion Restriction. Notice that the above bounds let 
F(y\x) and P(n = l\x) depend on x arbitrarily. However, often when facing selection 
problems researchers impose exclusion restrictions. That is, the researcher assumes that 
there are some components of x that affect P(n = l\x), but not F(y\x). Availability of such 
an instrument, denoted v, can help shrink the bounds on Q y (a\x). For concreteness, we 
replace x with (x,v) and suppose that F(y\x,v) = F{y\x) \/v E supp(v\x). By the same 
reasoning as above, for each v S supp{v\x) we have the following bounds on the conditional 
distribution function: 

F(y\x, v, u = l)P(n = l\x, v) < F(y\x) < F(y\x, v,u = l)P(ii = l\x, v ) + P(u = 0\x, v). 
This implies that the conditional quantile function satisfies: 

Qo(a\x,v) < Q y (a\x) < Qi(a\x,v) S supp(v\x), 

and therefore 

sup Qo(a\x,v) < Q y (a\x) < inf Qi(a\x, v) 

v£supp(v\x) v£supp(v\x) 

where Qi(a\x,v), £ = 0,1, are defined similarly to the previous section with x replaced 
by (x,v). Once again, we can avoid computing the propensity score by constructing the 
variables yn, £ = 0, 1 as in equation (|3.3p . Then Qy e (a\x, v) = Qi{a\x,v). Inspecting the 
formulae for Qi(a\x, v), £ = 0,1, reveals that Qo(a\x,v) can only be greater than yo when 



10 



CHANDRASEKHAR, CHERNOZHUKOV, MOLINARI, AND SCHRIMPF 



1 — P(u = l\x, v) < a, and Qi(a\x,v) can only be smaller than y\ when a < P(tt = l\x,v). 
Thus, both bounds are informative only when 



From this we see that the bounds are more informative for central quantiles than extreme 
ones. Also, the greater the probability of being selected conditional on x, the more in- 
formative the bounds are. If P (u = l\x,v) = 1 for some v S supp(v\x) then Q y (a\x) is 
point-identified. This is the large-support condition required to show non-parametric iden- 
tification in selection models. If P(u = l\x,v) < 1/2 the upper and lower bounds cannot 
both be informative. 

It is important to note that Q£(a\x,v), t = 0, 1 depend on the quantiles of y conditional 
on both x and v. Moreover, Q y (a\x, v, u = 1) is generally not linear in x and v, even in the 
special cases where Q y (a\x) is linear in x. Therefore, in practice, it is important to estimate 
Q y (a\x, v, u = 1) flexibly. Accordingly, our asymptotic results allow for series estimation of 
the bounding functions. 

3.4. Average, Quantile, and Distribution Treatment Effects. Researchers are often 
interested in mean, quantile, and distributional treatment effects. Our framework easily 
accommodates these examples. Let yp denote the outcome for person i if she does not 
receive treatment, and yj denote the outcome for person i if she receives treatment. The 
methods discussed in the preceding section yield bounds on the conditional quantiles of these 
outcomes. In turn, these bounds can be used to obtain bounds on the quantile treatment 
effect as follows: 



Note that inequalities f|3.4[) and (|3.5p cannot both hold. Thus, we cannot obtain informative 
bounds on the quantile treatment effect without an exclusion restriction. Moreover, to have 
an informative upper and lower bound on Q y t(o(\x) — Q Y c(a\x), the excluded variables, v, 
must shift the probability of treatment, P(u = l\x,v) sufficiently for both ()3.4p and (I3.5P 
to hold at x (for different values of v ) . 

Analogous bounds apply for the distribution treatment effect and the mean treatment 



Interval regressors can also be accommodated, by merging the results in this Section with those in 
Section 3.1. 



1 — P(u = < a < P(n = l\x,v). 




effectS 
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1.25 




4. Estimation and Inference 

4.1. Overview of the Results. This section provides a less technically demanding overview 
of our results, and explains how these can be applied in practice. Throughout, we use sam- 
ple selection as our primary illustrative example. As described in section [21 our goal is to 
estimate the support function, a(q,a). The support function provides a convenient way 
to compute projections of the identified set. These can be used to report upper and lower 
bounds on individual coefficients and draw two-dimensional identification regions for pairs of 
coefficients. For example, the bound for the kth component of (3(a) is [— a(— e^, a), a(ek, a)], 
where e& is the A;th standard basis vector (a vector of all zeros, except for a one in the 
kth position). Similarly, the bound for a linear combination of the coefficients, q'(3(a), 
is [— a(— q, a), a(q, a)]. Figures [1] provides an illustration. In this example, (3 is three di- 
mensional. The left panel shows the entire identified set. The right panel shows the joint 
identification region for (3\ and fo- The identified intervals for (3\ and (32 are also marked 
in red on the right panel. 

Suppose that x = [1;sei], with x\ a scalar random variable, so (3(a) = [/3o( a ) /?i(o0] is 
two-dimensional. To simplify notation, let z = x. In most applications, (3\(a) is the primary 
object of interest. Stoye (2007), Beresteanu and Molinari (2008) and Bontemps, Magnac, 
and Maurin (2010) give explicit formulae for the upper and lower bound of (3\(a). Recall 
that the support function is given by: 

a(q) = q'E[xx']' l E [x (6>i(x, a)l {q'E[xx'}- l x > 0} +6 (x,a)l {q'^xx^x < 0})] 

Setting q = (0 ±l) yields these bounds as follows: 

a ( \ _ E [(x u - Ejxnj) (Qui {x u < E[xu}} + 9 0i l { Xli > E[x u }})} 
- l[OL) E[xl}-E[ Xli ] 2 
-3 ( \ _ E [( Xli - E[ Xli ]) (9 U 1 { Xli > E[ Xli }} + 9 0i l { Xli < E[x u ]})] 
PlW E[x?J - E[xu? 
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where #oi = 9o(xi, a) and On = 9i(xi, a)@ 

4.1.1. Use of asymptotic results. We develop limit theory that allows us to derive the as- 
ymptotic distribution of the support function process and provide inferential procedures, 
as well as to establish validity of the Bayesian bootstrap. Bootstrapping is especially im- 
portant for practitioners, because of the potential complexity of the covariance functions 
involved in the limiting distributions. 

First, our limit theory shows that the support function process S n (t) := \fn (<r(i) — <7o(i)) 
for t £ S d ~ x xA is approximately distributed as a Gaussian process on xA. Specifically, 
we have that 

S n (t) = G[h k (t)]+op(l) 

in £°° (T) , where k denotes the number of series terms in our non-parametric estimator of 
0£(x,a), I = 0, 1, £°° (T) denotes the set of all uniformly bounded real functions on T, and 
hk {t) denotes a stochastic process carefully defined in Section T4.31 Here, G [hk {t)\ is a tight 
P-Brownian bridge with covariance function $7/% (t, t') = E [hk (t) hk {t')\— E [hk (i)] E (t')]. 
By "approximately distributed" we mean that the sequence &[hk(t)] does not necessarily 
converge weakly when k — > oo; however, each subsequence has a further subsequence con- 
verging to a tight Gaussian process in £°°(T) with a non-degenerate covariance function. 

Second, we show that inference is possible by using the quantiles of the limiting dis- 
tribution G [hk (t)] . Specifically, if we have a continuous function / that satisfies certain 
(non-restrictive) conditions detailed in Section 4.30 and c n (1 — r) = c n (1 — r) + op (1) is 
a consistent estimator of the (1 — r)-quantile of / (G [hk (t)]), given by c n (1 — r), then 

P{/(S n )<c n (l-r)}^l-r. 

Finally, we consider the limiting distribution of the Bayesian bootstrap version of the 
support function process, denoted S n (t) := y/n(a{t) — d{t)) , and show that, conditional on 
the data, it admits an approximation 

S n {t)=G\hk~(t)]+0 P e(l) 

where G [hk (t)] has the same distribution as G [hk (t)] and is independent of G [hk (t)] . Since 
the bootstrap distribution is asymptotically close to the true distribution of interest, this 
allows us to perform many standard and some less standard inferential tasks. 



^As one would expect from the definition of the support function and the properties of linear projection, 

= . nf cov(^,/Q_ 

- 1 /«e[0ioAil var(xii) 

-5 , , cov(xu, fi) 

P^a) = sup 



/«e[e i0 ,flii] var(ara) 

For example, functions yielding test statistics based on the directed Hausdorff distance and on the 
Hausdorff distance (see, e.g., Beresteanu and Molinari (2008)) belong to this class. 
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Pointwise asymptotics. Suppose we want to form a confidence interval for q'/3(a) for 
some fixed q and a. Since our estimator converges to a Gaussian process, we know that 

V n \ \ f \ ) ~d N (0, il(q, a)) . 

To form a confidence interval that covers the bound on q 1 (3{a) with probability 1 — r we 
can takH 

S(-q, a) + n 1/2 C T / 2 (q, a) < q'P(a) < a(q, a) + n~ 1/2 C 1 _ T / 2 (q, a) 

where the critical values, C T / 2 (q,a) and Ci_ T / 2 (q,a), are such that if (x\ x 2 ^' ~ N(0, f2), 
then 



P \xi > C r/2 (q, a), x 2 < Ci^ T / 2 {q, a) J = 1 - r + o p (l) 
If we had a consistent estimate of Q(q,a), then we could take 

(C T/2 (q,a) \ _ ^ 1/2 M- 1 ^ 

where is the inverse normal distribution function. However, the formula for £l(q,a) 

is complicated and it can be difficult to estimate. Therefore, we recommend and provide 
theoretical justification for using a Bayesian bootstrap procedure to estimate the critical 
values. See section [4.1.31 for details. 

Functional asymptotics. Since our asymptotic results show the functional convergence 
of S n (q, a), we can also perform inference on statistics that involve a continuum of values of 
q and/or a. For example, in our application to quantile regression with selectively observed 
data, we might be interested in whether a particular variable has a positive affect on the 
outcome distribution. That is, we may want to test 

Hq : £ [-a (-q,a),a (q,a)} Ma £ A, 

with q = Bj. A natural family of test statistics is 



T n = y/n sup 



l{-a(-q,a) > 0}\a(-q, a)\p(-q, a)V 
V{a(q,a) < 0}\d(q, a)\p(q, a) 



where p(q, a) > is some weighting function which can be chosen to maximize weighted 
power against some family of alternatives. There are many values of o"o(g, a) consistent with 
the null hypothesis, but the one for which it will be hardest to control size is — ao(— q, •) = 
a o(Qr) — 0- I n this case, we know that S n (t) = ^Jna{t), t = (q,a) G S^ 1 x A, is well 
approximated by a Gaussian process, G[hk(t)]. Moreover, the quantiles of any functional 
of S n (t) converge to the quantiles of the same functional applied to G[hk(t)]. Thus, we 
could calculate a r critical value for T n by repeatedly simulating a realization of G[hk(q, ■)], 
computing T n (G[hk(q, •)]), and then taking the (1 — r)-quantile of the simulated values of 



^Instead, if one believes there is some true value, q'Po(a), in the identified set, and one wants to cover this 
true value (uniformly) with asymptotic probability 1 — r, then one can adopt the procedures of Imbens and 
Manski (2004) and Stoye (2009), see also Bontemps, Magnac, and Maurin (2010) for related applications. 
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T n (G[hk(q, •)])!! Simulating G[hk(t)] requires estimating the covariance function. As stated 
above, the formula for this function is complicated and it can be difficult to estimate. 
Therefore, we recommend using the Bayesian bootstrap to compute the critical values. 
Theorem |4] proves that this bootstrap procedure yields consistent inference. Section 14.1.31 
gives a more detailed outline of implementing this bootstrap. Similar reasoning can be 
used to test hypotheses involving a set of values of q and construct confidence sets that are 
uniform in q and/or a. 

4.1.2. Estimation. The first step in estimating the support function is to estimate 9q{x,q) 
and Oi(x, a). Since economic theory often provides even less guidance about the functional 
form of these bounding functions than it might about the function of interest, our asymptotic 
results are written to accommodate non-parametric estimates of 6o(x,a) and 6i(x,a). In 
particular, we allow for series estimators of these functions. In this section we briefly review 
this approach. Parametric estimation follows as a special case where the number of series 
terms is fixed. Note that while the method of series estimation described here satisfies the 
conditions of theorems [T] and [2] below, there might be other suitable methods of estimation 
for the bounding functions. 

In each of the examples in section [3j except for the case of sample selection with an 
exclusion restriction, series estimates of the bounding functions can be formed as follows. 
Suppose there is some ya, a known function of the data for observation i, and a known 
function m(y, 9(x, a), a) such that 

9 e (-,a) = argmin E [m (yu, 9(xi, a), a)] , 

0£C 2 (X,P) 

where X denotes the support of x and C 2 (X, P) denotes the space of real- valued functions 
g such that J x \g(x)\ dP(x) < oo. Then we can form an estimate of the function 9i(-, a) by 
replacing it with its series expansion and taking the empirical expectation in the equation 
above. That is, obtaining the coefficients 

d e (a) = argmin E n [m {yn,p k {x i )''d,a)] , 

and setting 

9i(xi,a) = pkixiY&eioi). 
Here, Pk{xi) is a k x 1 vector of k series functions evaluated at X{. These could be any set 
of functions that span the space in which 9i(x,a) is contained. Typical examples include 
polynomials, splines, and trigonometric functions, see Chen (2007). Both the properties of 
m(-) and the choice of approximating functions affect the rate at which k can grow. We 
discuss this issue in more detail after stating our regularity conditions in section 14.21 

In the case of sample selection with an exclusion, one can proceed as follows. First, 
estimate Qy e (a\x, v ), i = 0, 1, using the method described above. Next, set 9i(xi,a) = 
min vl zsupp(v\x) Qy t (a\x,v). Below we establish the validity of our results also for this case. 

^This procedure yields a test with correct asymptotic size. However, it might have poor power properties 
in some cases. In particular, when ao(—q,a) ^ ao(q,a), the critical values might be too conservative. 
One can improve the power properties of the test by applying the generalized moment selection procedure 
proposed by Andrews and Shi (2009) to our framework. 



INFERENCE FOR BEST LINEAR APPROXIMATIONS TO SET IDENTIFIED FUNCTIONS 



15 



4.1.3. Bayesian Bootstrap . We suggest using the Bayesian Bootstrap to conduct inference. 
In particular, we propose the following algorithm. 

Procedure for Bayesian Bootstrap Estimation. 

(1) Simulate each bootstrap draw of a(q,a) : 

(a) Draw ej ~ exp(l), i = 1, n, e = E„[eJ 

(b) Estimate: 



9 e (x,a) 



arg min E n 

■d 

^ re 

E. 



l e 



i 



w 



i,q'T, 

a(q,a) 



9 1 {x,a)l{q'T l z > 0) + 9 (x, a)l(q , t,z < 0), 



E, 



l e 



(2) Denote the bootstrap draws as cf^, 6=1, ...,-B, and let Sn = \fn{5^ — a). To 
estimate the 1 — r quantile of T{S n ) use the empirical 1 — r quantile of the sample 



T(SX 



(3) Confidence intervals for linear combinations of coefficients can be obtained as out- 
lined in Section 14.1.11 Inference on statistics that involve a continuum of values of 
q and/or a can be obtained as outlined in Section [4.1.11 

4.2. Regularity Conditions. In what follows, we state the assumptions that we maintain 
to obtain our main results. We then discuss these conditions, and verify them for the 
examples in Section El 

CI (Smoothness of Covariate Distribution). The covariates Zi have a sufficiently smooth 
distribution, namely for some < m < 1, we have that P | < 6) /S m < 1 as 

8 ~\ uniformly in q £ with d the dimension of x. The matrix £ = (E[xjZ.j])~ 1 is 

finite and invertible. 

C2 (Linearization for the Estimator of Bounding Functions ). Let 9 denote either the 
unweighted estimator 9 or the weighted estimator 9, and let V{ = 1 for the case of the 
unweighted estimator, and Vi = for the case of the weighted estimator. We assume that 
for each I = 0, 1 the estimator 0£ admits a linearization of the form: 

y/n(9 e (x,a) - 9 e (x,a)) = p k (x)' J^ 1 (a)G n [vipnp ie (a)] + R e (x, a) (4.6) 

where pi = Pk{xi), sup Qg _4 \\Ri(xi, a)||p n ,2 ->p 0, and (xj, Zi, ipa) are i.i.d. random elements. 

C3 (Design Conditions). The score function <Pie(a) is mean zero conditional on Xi,Zi and 
has uniformly bounded fourth moment conditional on Xi,Z{. The score function is smooth 



in mean-quartic sense: E 



1/2 



< C II a — all ^ for some constants 



C and 7^ > 0. Matrices Ji(a) exist and are uniformly Lipschitz over a £ A, & bounded 
and compact subset of R 1 , and sup Q£ _4 || J~ 1 (a)|| as well as the operator norms of matrices 



E[jgj^], E[zjp£], and E[||pjp^|| 2 ] are uniformly bounded in k. E[||zj||°] and E[||a;i|| 6 ] are finite. 
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E a)|| 6 ] is uniformly bounded in a, and E[|(/jj£(a) 4 |xj, Zj\ is uniformly bounded in a, 

x, and z. The functions 6i(x,a) are smooth, namely \Q(_{x,ot) — Qg{x,ot)\ < ||a — a|| 7e 
for some constant 70 > and some function L(x) with E [L(x) 4 ] bounded. 

C4 (Growth Restrictions). When k — > 00, sup^g^ < and the following growth 

condition holds on the number of series terms: 



log 2 n \ n m / 4 + \J (k/n) ■ logn • max \\zi\\ A ^ 

where T\ is defined in Condition IC5I below. 

C5 (Complexity of Relevant Function Classes). The function set T\ = {ipa(a),a G A, £ = 
0,1} has a square P-integrable envelope F\ and has a uniform covering L2 entropy equivalent 
to that of a VC class. The function class T2 =! {0n(a),a £ A,i = 0,1} has a square P- 
integrable envelope F2 for the case of fixed k and bounded envelope F2 for the case of 
increasing k, and has a uniform covering L2 entropy equivalent to that of a VC class. 

4.2.1. Discussion and verification of conditions. Condition IC1I requires that the covariates 
zi be continuously distributed, which in turn assures that the support function is everywhere 
differentiable in q £ see Beresteanu and Molinari (2008, Lemma A. 8) and Lemma[3]in 

the Appendix. With discrete covariates, the identified set has exposed faces and therefore 
its support function is not differentiable in directions q orthogonal to these exposed faces, 
see e.g., Bontemps, Magnac, and Maurin (2010, Section 3.1). In this case, Condition IC1I 
can be met by adding to the discrete covariates a small amount of smoothly distributed 
noise. Adding noise gives "curvature" to the exposed faces, thereby guaranteeing that the 
identified set intersects its supporting hyperplane in a given direction at only one point, and 
is therefore differentiable, see Schneider (1993, Corollary 1.7.3). Lemma[7]in the Appendix 
shows that the distance between the true identified set and the set resulting from jittered 
covariates can be made arbitrarily small. Therefore, in the presence of discrete covariates 
one can apply our method obtaining arbitrarily slightly conservative inference. 

Assumptions IC3I and IC5I are common regularity conditions and they can be verified using 
standard arguments. Condition IC2I requires the estimates of the bounding functions to be 
asymptotically linear. In addition, it requires that the number of series terms grows fast 
enough for the remainder term to disappear. This requirement must be reconciled with 
Condition IC4} which limits the rate at which the number of series terms can increase. We 
show below how to verify these two conditions in each of the examples of Section [3l 

Example (Mean regression, continued). We begin with the simplest case of mean regression 
with interval valued outcome data. In this case, we have 9e(-,a) = Pk(-)'$£ with ■&£ = 
(P' P)^ 1 P'l/i and P = [pfc(xi), ...,pk(x n )]' . Let be the coefficients of a projection of 
E[^|xj] on P, or pseudo-true values, so that = [P'P)~ P'~Ei\yi\x£\. We have the following 
linearization for 6e(-,a) 

v 7 ^ (e t (x, a) - 9 e (x, a)) = V^PkWiP'P^P'iye - E[y e \x]) + (p k (x)% - 9 e (x, a)) . 

This is in the form of (|4.6p with Ji{a) = P'P, (pa(a) = (ya — E[ye\xi]), and Re(x,a) = 
\fn(pk{x)' , &k ~ Ql(pi°<))- The remainder term is simply approximation error. Many results 



max Ff 

i<n, 



^ P 0, ^ log 2 n/n ^ 0, 
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on the rate of approximation error are available in the literature. This rate depends on the 
choice of approximating functions, smoothness of 6t(x,a), and dimension of x. When using 
polynomials as approximating function, if 9g(x,a) = E[y^|xj] is s times differentiable with 
respect to x, and x is d dimensional, then (see e.g. Newey (1997) or Lorentz (1986)) 

sup|p fc (s)'0 fc - O t (x,a)\ = 0(k- s ' d ). 

X 

In this case IC2I requires that n l / 2 k~ s / d — > 0, or that k grows faster than n2i. Assumption 
IC4I limits the rate at which k can grow. This assumption involves £k and supj a \ ipi (a) \ . 
The behavior of these terms depends on the choice of approximating functions and some 
auxiliary assumptions. With polynomials as approximating functions and the support of x 
compact with density bounded away from zero, = 0{k). If yn — E[y^|xj] has exponential 
tails, then supj a \<pi(a)\ = 0(2(log ra) 1 / 2 ). In this case, a sufficient condition to meet IC4l is 
that k = o(n 1 / 3 log -6 n). Thus, we can satisfy both IC2l and IC4l by setting k oc n 1 for any 7 
in the interval connecting ^ and i. Notice that as usual in semiparametric problems, we 
require undersmoothing compared to the rate that minimizes mean-squared error, which is 
7 = d d 2s . Also, our assumption requires increasing amounts of smoothness as the dimension 
of x increases. 

We now discuss how to satisfy assumptions IC2I and IC4I more generally. Recall that in our 
examples, the series estimates of the bounding functions solve 

9g(-,a)= argmin E n [m(y ie ,6 e (xi,a),a)} 
e e ec 2 (x,p) 

or 0£(-,a) = Pk{-)'$£ with ■dg = argmin^E^ [m(yu,Pk(xi)''d, a)] . As above, let "dk be the 
solution to = arg min^ E [m(yn,pk(xi)''&, a)] . We show that the linearization in IC2I holds 
by writing 

y/n (d(x, a) - 6 e (x, a) J = ^/np k (x) (d - tf fc ) + ^fn (p k {x)'-d k - 0i(x, a)) . (4.7) 

The first term in (14. 7D is estimation error. We can use the results of He and Shao (2000) to 
show that 

where ip denotes the derivative of m(yig,pk(xi)''&, a) with respect to 1?. 

The second term in (14.7P is approximation error. Standard results from approximation 
theory as stated in e.g. Chen (2007) or Newey (1997) give the rate at which the error from the 
best L2-approximation to 9g disappears. When m is a least squares objective function, these 
results can be applied directly. In other cases, such as quantile or distribution regression, 
further work must be done. 

Example (Quantile regression with interval valued data, continued). The results of Bel- 
loni, Chernozhukov, and Fernandez- Val (2011) can be used to verify our conditions for 
quantile regression. In particular, Lemma 1 from Appendix B of Belloni, Chernozhukov, 
and Fernandez- Val (2011) gives the rate at which the approximation error vanishes, and 
Theorem 2 from Belloni, Chernozhukov, and Fernandez- Val (2011) shows that the lineariza- 
tion condition (|C2p holds. The conditions required for these results are as follows. 
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(Q.l) The data {(yio,ya, Xi), 1 < i < n} are an i.i.d. sequence of real (2 + eZ)-vectors. 
(Q.2) For £ = {0,1}, the conditional density of yi given x is bounded above by /, 

below by /, and its derivative is bounded above by /' uniformly in yi and x. 

f ye \x {Qy e \x( a \ x )\ x ) is bounded away from zero uniformly in a E A and x G X. 
(Q.3) For all k, the eigenvalues of E[pjp^] are uniformly bounded above and away from 

zero. 

(Q.4) £ k = 0(k a ). Q yt \ x is s times continuously differentiable with s > (a + l)d. The 
series functions are such that 

ME[\\p k #-ee(x,a)\\ 2 ] = 0(k- s / d ) 

V 

and 

M\\p k <&-8 t (x,a)\\ 00 = 0(k-'' d ). 

V 

(Q.5) k is chosen such that fe 3+6a (log n) 7 = o(n). 



Condition Q.4 is satisfied by polynomials with a = 1 and by splines or trigonometric series 
with a = 1/2. Under these assumptions, Lemma 1 of appendix B from Belloni, Cher- 
nozhukov, and Fernandez- Val (2011) shows that the approximation error satisfies 



sup \p k (x)'-d k (a) - 9i{x,a)\ < k 

x£X,aeA 



ad — s 
d . 



Theorem 2 of Belloni, Chernozhukov, and Fernandez- Val (2011) then shows that IC2I holds. 
Condition IC4I also holds because for quantile regression tpi is bounded, so IC4I only requires 
k + (log re) 2 = o(n), which is implied by |Q.5| 

Example (Distribution regression, continued). As described above, the estimator solves 
■d = argmin^E n , [m(yi,p k (x i Y'&, a)] with 

m(y i ,p k {x i )'^ 1 a) = -l{y < a} log $ (pfc(scO'0) - l{y > a} log (l - $ (p k (xi)'-&)) 

for some known distribution function <3?. We must show that estimation error, $ — "d k , can 
be linearized and that the bias, p k (x)'& k — 9e(x, a), is o(n~ 1 / 2 ). We first verify the conditions 
of He and Shao (2000) to show that (i? — $ k ) can be linearized. Adopting their notation, in 
this example we have that the derivative of m(yi,p k {xi) lr d,a) with respect to •& is 

\®(pk{xiyv) i - $ (pk(xi) v) J 

where <f> is the pdf associated with <!>. Because m(yi,p k (xi)''&,a) is a smooth function of 
i?, K n ip(yi,Xi,-d) = 0, and conditions CO and C.2 in He and Shao (2000) hold. If <f> is 
differentiable with a bounded derivative, then tp is Lipschitz in §, and we have the bound 

where 77i(i?,r) = V(?/i,^i,i?) - ip(yi,%i,T) - Eip(y i ,x i ,'d) + Eij){y h Xi,T). If we assume 

max 11^(^)11 =0(k a ), 

i<n 

as would be true for polynomials with a = 1 or splines with a = 1/2, and k is of order less 
than or equal to n l / a then condition C.l in He and Shao (2000) holds. Differentiability of 
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4> and lC3l are sufficient for C.3 in He and Shao (2000). Finally, conditions C.4 and C.5 hold 
with A(n, k) = k because 

\s / rj i ('d,T)\ 2 < \s'p k (xi)\ 2 \\t - $|| 2 , 

and E [|s'pfc(xi)| 2 ] is uniformly bounded for s G S k for all k when the series functions 
are orthonormal. Applying Theorem 2.2 of He and Shao (2000), we obtain the desired 
linearization if k = o ((n/ log n) 1 / 2 ) . 

The results of Hirano, Imbens, and Ridder (2003) can be used to show that the approxi- 
mation bias is sufficiently small. Lemma 1 from Hirano, Imbens, and Ridder (2003) shows 
that for the logistic distribution regression, 

IK fe -^|| =0{k- s 'W) 

sup|*(pjfc(x)i? fc ) -*(fc(x,a))| =0(k' s ^ 2d kk), 

X 

which implies that 

sup \p k (x)0 k - e e (x, a)\ =0(k- s '^i k ). 

x£X,a£A 

This result is only for the logistic link function, but it can easily be adapted for any link 
function with first derivative bounded from above and second derivative bounded away from 
zero. We need the approximation error to be o(ra -1 / 2 ). For this, it suffices to have 

fc-'/^&n 1 / 2 = o(l). 

Letting £ k = 0(k a ) as above, it suffices to have k oc n 1 for 7 > _ 2 , . 

To summarize, condition IC2I can be met by having k oc n 1 for any 7 G i^ s-iad ' l)' 
Finally, as in the mean and quantile regression examples above, condition IC4I will be met if 
7 < T+2a- 

4.3. Theoretical Results. In order to state the result we define 

hk(t) := q'T,E[zip' i l{q , 'Ezi > 0}] Jf 1 (a)p i (p i x(a) 
+ g / EE[^p-l{g / Sz i < 0}] Jq 1 (a)pnpio(a) 
- q'YiXiz'iEE [ziWi, q 'z(a)] 

Theorem 1 (Limit Theory for Support Function Process). The support function process 

S n (t) = y/n (dg g — o"e,s^ (t), where t = (q,a) G S^ 1 x A, admits the approximation 

S n (t) = & n [hk(t)] + op(l) in £°°(T). Moreover, the support function process admits an 
approximation 

S n (t) = G[h k (t)]+o P (l) in£°°(T), 
where the process G[h k (t)] is a tight P-Brownian bridge in £°°(T) with covariance function 
£l k (t,t') = Fi[h k {t)h k (t')] — Fi[h k {t)]E[h k (t')] that is uniformly Holder on T x T uniformly 
in k, and is uniformly non- degenerate in k. These bridges are stochastically equicontinuous 
with respect to the L2 pseudo-metric P2(t,t') = \E[h(t) — /i(t')] 2 ] 1 / 2 < \\t — t'\\ c for some 
c > uniformly in k. The sequence G[h k (t)] does not necessarily converge weakly under 
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k oo; however, each subsequence has a further convergent subsequence converging to a 
tight Gaussian process in £°°(T) with a non- degenerate covariance function. Furthermore, 
the canonical distance between the law of the support function process S n {t) and the law of 
G[h k (t)] in£°°(T) approaches zero, namely sup 56BLl(£ oc( T)) [ 0;1 ]) \E[g(S n )]-E[g(G[h k })}\ ->■ 0. 

Next we consider the behavior of various statistics based on the support function process. 
Formally, we consider these statistics as mappings / : £°°(T) — > R from the possible values 
s of the support function process S n to the real line. Examples include: 

• a support function evaluated at t G T, f(s) = s(t), 

• a Kolmogorov statistic, f(s) = sup fgTo \s(t)\/-tu(t), 

• a directed Kolmogorov statistic, /(s) = sup tgTo {— s(t)} + /~cv(t), 

• a Cramer- Von-Mises statistic, f(s) = f T s 2 (t)/w(t)du(t), 

where To is a subset of T, w is a continuous and uniformly positive weighting function, and 
v is a probability measure over T whose support is More generally we can consider any 
continuous function / such that f(Z) (a) has a continuous distribution function when Z is a 
tight Gaussian process with non-degenerate covariance function and (b) /(Cn + c) — /(£n) = 
o(l) for any c = o(l) and any ||£ n || = 0(1). Denote the class of such functions T c and note 
that the examples mentioned above belong to this class by the results of Davydov, Lifshits, 
and Smorodina (1998). 

Theorem 2 (Limit Inference on Support Function Process). Furthermore, the canonical 
distance between the law of the support function process S n (t) and the law of G[h k (t)] in 
£°°(T) approaches zero, namely sup 9eBLl{£ oo( T)i [ 0il ]) lEf^S^)] - E[#(0[fy.])]| 0. For any 
c n = c n + op(l) and c n = Op(l) and f G T c we have 

P{f(Sn) < Cn} ~ P{/(G[fc fc ]) < c n } -> 0. 

If c n (l — t) is the (1 — r)-quantile of f{G[h k \) and c n (l — r) = c n (l — r) + op(l) is any 
consistent estimate of this quantile, then 

P{/(5 n )<c n (l-r)}^l-r. 

Let e° = ei — 1, h° k = h k — E[/ifc], and let P e denote the probability measure conditional 
on the data. 

Theorem 3 (Limit Theory for the Bootstrap Support Function Process). The bootstrap 
support function process S n (t) = ^/n(a^^ — a^^){t), where t = (q,a) G x A, admits 

the following approximation conditional on the data: S n (t) = G n {e°h^(t)\ + ope(l) in£°°(T) 
in probability P. Moreover, the bootstrap support function process admits an approximation 
conditional on the data: 

S n (t) = G[hk(t)] + ope(l) in £°°(T), in probability P, 

where G[hk] is a sequence of tight P-Brownian bridges in £°°(T) with the same distribu- 
tions as the processes G[/ifc] defined in Theorem 1, and independent of G[hk\. Further- 
more, the canonical distance between the law of the bootstrap support function process 

-^Observe that test statistics based on the (directed) Hausdorff distance (see, e.g., Beresteanu and Moli- 
nari (2008)) are special cases of the (directed) Kolmogorov statistics above. 



INFERENCE FOR BEST LINEAR APPROXIMATIONS TO SET IDENTIFIED FUNCTIONS 21 

S n (t) conditional on the data and the law of G[hk(t)] in £°°(T) approaches zero, namely 

su P 5 eBLi(^(T),[0,l]) l E P e b(Sn)] ~ E b(G[/lfc])] | -* P 0. 

Theorem 4 (Bootstrap Inference on the Support Function Process). For any c n = Op(l) 
and f € T c we have 

P{/(5„) < c n } - P e {f(S n ) < c n } -+ P 0. 
In particular, ifc n (l — r) is the (1 — r)-quantile of f(S n ) under P e , then 

P{/(Sn)<H»(l-T)}^l-T. 

5. Application: the gender wage gap and selection 

An important question in labor economics is whether the gender wage gap is shrink- 
ing over time. Blau and Kahn (1997) and Card and DiNardo (2002), among others, have 
noted the coincidence between a rise in within-gender inequality and a fall in the gender 
wage gap over the last 40 years. Mulligan and Rubinstein (2008) observe that the growing 
wage inequality within gender should induce females to invest more in productivity. In 
turn, able females should differentially be pulled into the workforce. Motivated by this 
observation, they use Heckman's two-step estimator on repeated Current Population Sur- 
vey cross-sections in order to compute relative wages for women since 1970, holding skill 
composition constant. They find that in the 1970s selection into the female workforce was 
negative, while in the 1990s it was positive. Moreover, they argue that the majority of the 
reduction in the gender gap can be attributed to the changes in the female workforce com- 
position. In particular, the OLS estimates of the log-wage gap has fallen from -0.419 in the 
1970s to -0.256 in the 1990s, though the Heckman two step estimates suggest that once one 
controls for skill composition, the wage gap is -0.379 in the 1970s and -0.358 in the 1990s. 
Based on these results, Mulligan and Rubinstein (2008) conclude that the wage gap has not 
shrunk over the last 40 years. Rather, the behavior of the OLS estimates can be explained 
by a switch from negative to positive selection into female labor force participation. 

In what follows, we address the same question as Mulligan and Rubinstein (2008), but 
use our method to estimate bounds on the quantile gender wage gap without assuming a 
parametric form of selection or a strong exclusion restriction^ We follow their approach of 
comparing conditional quantiles that ignore the selection effect, with the bounds on these 
quantiles that one obtains when taking selection into account. 

Our results show that we are unable to reject that the gender wage gap declined over the 
period in question. This suggests that the instruments may not be sufficiently strong to 
yield tight bounds and that there may not be enough information in the data to conclude 
that the gender gap has or has not declined from 1975 to 1999 without strong functional 
form assumptions. 

5.1. Setup. The Mulligan and Rubinstein (2008) setup relates log-wage to covariates in a 
linear model as follows: 

log w = x'fl + e, 



We use the same data, the same variables and the same instruments as in their paper. 
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Table 1. Gender wage gap estimates 





OLS 


2-step 


QR(0.5) 


Low 


High 


1975-1979 


-0.408 


-0.360 


-0.522 


-1.242 


0.588 




(0.003) 


(0.013) 


(0.003) 


(0.016) 


(0.061) 


1995-1999 


-0.268 


-0.379 


-0.355 


-0.623 


0.014 




(0.003) 


(0.013) 


(0.003) 


(0.012) 


(0.010) 



This table shows estimates of the gender wage gap (female — male) conditional on having average char- 
acteristics. The first column shows OLS estimates of the average gender gap. The second column shows 
Heckman two-step estimate. The third column shows quantile regression estimates of the median gender 
wage gap. The fourth and fifth columns show estimates of bounds on the median wage gap that account 
for selection. Standard errors are shown in parentheses. The standard errors were calculated using the 
reweighting bootstrap described above. 

wherein x includes marital status, years of education, potential experience, potential ex- 
perience squared, and region dummies, as well as their interactions with an indicator for 
gender which takes the value 1 if the individual is female, and zero otherwise. They model 
selection as in the following equation: 

u = 1 {z'j > 0} , 

where z = [x z) and z is marital status interacted with indicators for having zero, one, two, 
or more than two children. 

For each quantile, we estimate bounds for the gender wage gap utilizing our method. 
The bound equations we use are given by 6i(x,v,a) = pk(x,v)''df v (a), where pk(x,v) = 
x v w\ , v are indicators for the number of children, and w consists of years of education 
squared, potential experience cubed, and education x potential experience, and v inter- 
acted with marital status. After taking the intersection of the bounds over the excluded 
variables v, our estimated bounding functions are simply the minimum or maximum over 
v of pk(x, v)' -dj" {a) . 

5.2. Results. Let Xf be a female with average (unconditional on gender) characteristics 
and x m be a male with average (unconditional on gender or year) characteristics. In what 
follows, we report the predicted gender wage gap for someone with average characteristics, 
(xj — x m ) /3(a). The first two columns of table [Q reproduce the results of Mulligan and 
Rubinstein (2008). The first column shows the gender wage gap estimated by ordinary least 
squares. The second column shows estimates from Heckman's two-step selection correction. 
The OLS estimates show a decrease in the wage gap, while the Heckman selection estimates 
show no change. The third column shows estimates of the median gender wage gap from 
quantile regression. Like OLS, quantile regression shows a decrease in the gender wage 
gap. The final two columns show bounds on the median gender wage gap that account for 
selection. The bounds are wide, especially in the 1970s. In both periods, the bounds do 
not preclude a negative nor a positive gender wage gap. The bounds let us say very little 
about the change in the gender wage gap. 

Figure [2] shows the estimated quantile gender wage gaps in the 1970s and 1990s. The 
solid black line shows the quantile gender wage gap when selection is ignored. In both the 
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Figure 2. Bounds at Quantiles for full sample 
1975-1979 1995-1999 
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This figure shows the estimated quantile gender wage gap (female — male) conditional on having average 
characteristics. The solid black line shows the quantile gender wage gap when selection is ignored. The 
blue and red lines with upward and downward pointing triangles show upper and lower bounds that account 
for employment selection for females. The dashed lines represent a uniform 90% confidence region for the 
bounds. 



1970s and 1990s, the gender wage gap is larger for lower quantiles. At all quantiles the gap 
in the 1990s is about smaller 40% smaller than in the 1970s. However, this result should 
be interpreted with caution because it ignores selection into the labor force. 

The blue line with downward pointing triangles and the red line with upward pointing 
triangles show our estimated bounds on the gender wage gap after accounting for selection. 
The dashed lines represent a uniform 90% confidence region. In both the 1970s and 1990s, 
the upper bound lies below zero for low quantiles. This means that the low quantiles of the 
distribution of wages conditional on having average characteristics are lower for a woman 
than for a man. This difference exists even if we allow for the most extreme form of selection 
(subject to our exclusion restriction) into the labor force for women. For quantiles at or 
above the median, our estimated upper bound lies above zero and our lower bound lies below 
zero. Thus, high quantiles of the distribution of wages conditional on average characteristics 
could be either higher or lower for women than for men, depending on the true pattern of 
selection. For all quantiles, there is considerable overlap between the bounded region in the 
1970s and in the 1990s. Therefore, we can essentially say nothing about the change in the 
gender wage gap. It may have decreases, as suggested by least squares or quantile regression 
estimates that ignore selection. The gap may also have stayed the same, as suggested by 
Heckman selection estimates. In fact, we cannot even rule out the possibility that the gap 
increased. 

The bounds in figure [2] are tighter in the 1990s than in the 1970s. This reflects higher 
female labor force participation in the 1990s. To find even tighter bounds, we can repeat the 
estimation focusing only on subgroups with higher labor force attachment. Figures [3J16] show 
the estimated quantile gender wage gap conditional on being in certain subgroups. That is, 
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FIGURE 3. Quantile bounds for singles 
1975-1979 1995-1999 
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This figure shows the estimated quantile gender wage gap (female — male) conditional on being single with 
average characteristics. The solid black line shows the quantile gender wage gap when selection is ignored. 
The blue and red lines with upward and downward pointing triangles show upper and lower bounds that 
account for employment selection for females. The dashed lines represent a uniform 90% confidence region 
for the bounds. 



rather than reporting the gender wage gap for someone with average characteristics, these 
figures show the gender wage gap for someone with average subgroup characteristics (e.g., 
unconditional on gender or year, conditional on subgroup: marital status and educaiton 
level). To generate these figures, the entire model was re-estimated using only observations 
within each subgroup. 

Figures [3] and |4] show the results for singles and people with at least 16 years of education. 
The results are broadly similar to the results for the full sample. There is robust evidence of a 
gap at low quantiles, although it is only marginally significant for the highly educated in the 
1990s. As expected, the bounds are tighter than the full sample bounds. Nonetheless, little 
can be said about the gap at higher quantiles or the change in the gap. For comparison, 
figure [5] shows the results for people with no more than a high school education. These 
bounds are slightly wider than the full sample bounds, but otherwise very similar. Figure 
[6] shows results for singles with at least a college degree. These bounds are the tighter than 
all others, but still do not allow us to say anything about the change in the gender wage 
gap. Also, there is no longer robust evidence of a gap at low quantiles. A gap is possible, 
but we cannot reject the null hypothesis of zero gap at all quantiles at the 10% level. 
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Figure 4. Quantile bounds for > 16 years of education 
1975-1979 1995-1999 
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This figure shows the estimated quantile gender wage gap (female — male) conditional on having at least 
16 years of education with average characteristics. The solid black line shows the quantile gender wage gap 
when selection is ignored. The blue and red lines with upward and downward pointing triangles show upper 
and lower bounds that account for employment selection for females. The dashed lines represent a uniform 
90% confidence region for the bounds. 




This figure shows the estimated quantile gender wage gap (female — male) conditional on having 12 or fewer 
years of education with average characteristics. The solid black line shows the quantile gender wage gap 
when selection is ignored. The blue and red lines with upward and downward pointing triangles show upper 
and lower bounds that account for employment selection for females. The dashed lines represent a uniform 
90% confidence region for the bounds. 



26 



CHANDRASEKHAR, CHERNOZHUKOV, MOLINARI, AND SCHRIMPF 



Figure 6. Quantile bounds for > 16 years of education and single 
1975-1979 1995-1999 
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This figure shows the estimated quantile gender wage gap (female — male) conditional on being single with 
at least 16 years of education and average characteristics. The solid black line shows the quantile gender 
wage gap when selection is ignored. The blue and red lines with upward and downward pointing triangles 
show upper and lower bounds that account for employment selection for females. The dashed lines represent 
a uniform 90% confidence region for the bounds. 
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5.2.1. With restrictions on selection. Blundell, Gosling, Ichimura, and Meghir (2007) study 
changes in the distribution of wages in the UK. Like us, they allow for selection by estimating 
quantile bounds. Also, like us, Blundell, Gosling, Ichimura, and Meghir (2007) find that the 
estimated bounds are quite wide. As a result, they explore various restrictions to tighten 
the bound. One restriction is to assume that the wages of the employed stochastically 
dominates the distribution of wages for those not working. This implies that the observed 
quantiles of wages conditional on employment are an upper bound the quantiles of wages 
not conditional on employment. 

Figure [7] shows results imposing stochastic dominance for the full sample and for highly 
educated singles. Stochastic dominance implies that the upper bound coincides with the 
quantile regression estimate. With stochastic dominance there is robust evidence of a gender 
wage gap at all quantiles in both the 1970s and 1990s, for both the full sample and the 
single highly educated subsample. The bounds with stochastic dominance are much tighter 
than without. In fact, it appears that they may be tight enough to say something about the 
change in the gender wage gap. Accordingly, figure [8] shows the estimated bounds for the 
change in the gender wage gap. It shows results for both the full sample and the single high 
education subsample. For the full sample, the estimated bounds include zero at low and 
moderate quantiles. At the 0.6 and higher quantiles, there is significant evidence that the 
gender wage gap decreased by approximately 0.15 log dollars. For highly educated singles, 
the change in the gender wage gap is not significantly different from zero for any quantiles. 

The assumption of positive selection into employment is not innocuous. It may be violated 
if there is a strong positive correlation between potential wages and reservation wages. 
This may be the case if there is positive assortative matching in the marriage market. 
Women with high potential wages could marry men with high wages, making these high 
potential wage women less likely to work. Also, the conclusion of Mulligan and Rubinstein 
(2008) that there was a switch from adverse selection into the labor market in the 1970s to 
advantageous selection in the 1990s implies that stochastic dominance did not hold in the 
1970s. Accordingly, we also explore some weaker restrictions. Blundell, Gosling, Ichimura, 
and Meghir (2007) propose a median restriction — that the median wage offer for those 
not working is less than or equal to the median observed wage. This restriction implies the 
following bounds on the distribution of wages 



where y is wage and u = 1 indicates employment. Transforming these into bounds on the 
conditional quantiles yields 



F{y\x,u = l)P(u = l|x) + l{y > Q y {0.5\x,u = l)}0.5P(tt = 0|x) < 
< F(y\x) < F(y\x,u = l)P(u = l\x)+P(u = 0\x), 



Qo {a\x) < Q y (a\x) < Qi (a\x) 



where 
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FIGURE 7. Quantile bounds for full sample imposing stochastic dominance 
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This figure shows the estimated quantile gender wage (female — male) conditional on average characteristics. 
The solid black line shows the quantile gender wage when selection is ignored. The blue and red lines with 
upward and downward pointing triangles show upper and lower bounds that account for employment selection 
for females. The dashed lines represent a uniform 90% confidence region for the bounds. 



and 



Qi(a\x) 



Qy \ p( u =i\x) 

q ( q-0.5P(u=( 



P(tt=l|a;) 



x,u = 1 
X, u 



1 1/1 



if a < 0.5 & a < P(n = l\x) 
if a > 0.5 fc a < 1+p ^ =1 l*) . 
otherwise 



As above, we can also express Qo(a\x) and Q\(a\x) as the a conditional quantiles of yo and 
yi where 

m =2/1 = 1} + yoi {u = 0} 
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Figure 8. Quantile bounds for the change in the gender wage gap imposing 
stochastic dominance 
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This figure shows the estimated change (1990s — 1970s) in the quantile gender wage gap (female — male) 
conditional on having average characteristics. The solid black line shows the quantile gender wage gap when 
selection is ignored. The blue and red lines with upward and downward pointing triangles show upper and 
lower bounds that account for employment selection for females. The dashed lines represent a uniform 90% 
confidence region for the bounds. 



and 




l} + yil{u = 0} 
1} + Q y (0.5\x,u -- 



1)1 {u = 0} 



with probability 0.5 
with probability 0.5 



We can easily generalize this median restriction by assuming the a± quantile of wages 
conditional on working is greater than or equal to the a® quantile of wages conditional on 
not working. In that case, the bounds can still be expressed as a conditional quantiles of 
jjo and yi with yo as defined above and 



_ (yl{u = l} + y 1 l{u = 0} 

I yl {u = 1} + Q y (ai\x, u = 1)1 {u 



0} 



with probability (1 — ao) 
with probability ao 



We can even impose a set of these restrictions for (ai,ao) 6 K C i x i. Stochastic 
dominance is equivalent to imposing this restriction for a± = ao for all ot\ E [0, 1]. 

Figure [9] show estimates of the gender wage gap with the median restriction. The results 
are qualitatively similar to the results without the restriction. As without the restriction, 
we obtain robust evidence of a gender wage gap at low quantiles in both the 1970s and 
1990s, and there is substantial overlap in the bounds between the two periods, so we cannot 
say much about the change in the gender wage gap. The main difference with the median 
restriction is that there is also robust evidence of a gender gap at quantiles 0.4-0.7, as well 
as at lower quantiles. 
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Figure 9. Quantile bounds imposing the median restriction 
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This figure shows the estimated quantile gender wage (female — male) conditional on average characteristics. 
The solid black line shows the quantile gender wage when selection is ignored. The blue and red lines with 
upward and downward pointing triangles show upper and lower bounds that account for employment selection 
for females. The dashed lines represent a uniform 90% confidence region for the bounds. 



6. Conclusion 



This paper provides a novel method for inference on best linear approximations to func- 
tions which are known to lie within a band. It advances the literature by allowing for 
bounding functions that may be estimated parametrically or non-parametrically by series 
estimators, and that may carry an index. Our focus on best linear approximations is moti- 
vated by the difficulty to work directly with the sharp identification region of the functions 
of interest, especially when the analysis is conditioned upon a large number of covariates. 
By contrast, best linear approximations are tractable and easy to interpret. In particular, 
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the sharp identification region for the parameters characterizing the best linear approxi- 
mation is convex, and as such can be equivalently described via its support function. The 
support function can in turn be estimated with a plug-in method, that replaces moments 
of the data with their sample analogs, and the bounding functions with their estimators. 
We show that the support function process approximately converges to a Gaussian process. 
By "approximately" we mean that while the process may not converge weakly as the num- 
ber of series terms increases to infinity, each subsequence contains a further subsequence 
that converges weakly to a tight Gaussian process with a uniformly equicontinuous and 
non-degenerate covariance function. We establish validity of the Bayesian bootstrap for 
practical inference, and verify our regularity conditions for a large number of empirically 
relevant problems, including mean regression with interval valued outcome data and inter- 
val valued regressor data; quantile and distribution regression with interval valued data; 
sample selection problems; and mean, quantile, and distribution treatment effects. 



32 



CHANDRASEKHAR, CHERNOZHUKOV, MOLINARI, AND SCHRIMPF 

Appendix A. Notation 



S^ 1 : = jg G R d : ||g|| = l} ; 

n 



i=l 

G[hk], G[hk] '■ = P-Brownian bridge processes, independent of each other, and 
with identical distributions; 

C 2 (X,P) : = jg : X — > R s.t. J \g{x)\ 2 dP(x) < ooj ; 

£°°(T) : set of all uniformly bounded real functions on T; 

BLi(£°°(T), [0, 1]) : set of real functions on £°°(T) with Lipschitz norm bounded by 1; 

< left side bounded by a constant times the right side; 

f° : =/"E/. 

Appendix B. Proof of the Results 
Throughout this Appendix, we impose Conditions ICHIU51 

B.l. Proof of Theorems 1 and 2. Step 1. We can write the difference between the 
estimated and true support function as the sum of three differences. 

- ^e,s = - o^g) + (a e £ - aep) + (3»,e - <re,v) 
where t £ T : = x A. Let fi := q'T, and 

Wi^{a) := (0q(x, a)l(fj,Zi < 0) + 8i(x, a)l(/i^ > 0)) . 

We define 

a e g := E n q'EziW i q ,^(a) and CT0 )S := E n [g' E^w^e (a)] . 
By Lemma Q] uniformly in t £ T 

^™ ~ °e,f) (*) = ^^[^'^{g'Szi > 0}]Jf 1 (a)G n [p i ^ii(o!)] 

+ g'EE^l^'E^ < 0}] J - 1 (a)G„[ K ^ (a)] + o P (l). 
By Lemma [2] uniformly in t 6 T 

^(°e,£ ~ ^.s) (*) = vW - SJ E + M 1 ) 

= -g'EG n [xi^]SE [^Wj i(? 's(a)] + op(l) 

= -g'SGn^j^SE [zjW ii(? / S (a)] +o P (l). 

By definition 

^ (o^.e - o"e,s) (*) = G n [g Szi-u; iig / E (n)]. 
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Putting all the terms together uniformly in t G T 

V^(%g - o"0,e)(*) = G n [h k (t)] + o P (l), 

where for t := (q, a) G T = x .4 

fc fc (i) := (/EEfepJltfE* > 0}]Jf ^aJpi^iCa) 
+ q , 'EE[ziPil{q'Ezi < 0}] J _1 (a)pi^i (a) 

:= M*) + M*) + M*)+M*), (B.8) 

where k indexes the number of series terms. 

Step 2. (Finite k). This case follows from % = {hi(t),t £ T} being a Donsker class with 
square-integrable envelopes. Indeed, H is formed as finite products and sums of VC classes 
or entropically equivalent classes, so we can apply Lemma [SJ The result 

G n [h k (t)]=>G[h k (t)] in £°°(T), 

follows, and the assertion that 

G n [h k (t)} = d G[h k (t)] +op(l) in £°°(T) 

follows from e.g., the Skorohod-Dudley-Whichura construction. (The =d can be replaced 
by = as in Step 3, in which case G[h k (t)] is a sequence of Gaussian processes indexed by n 
and identically distributed for each n.) 

Step 3. (Case with growing k.) This case is considerably more difficult. The main 
issue here is that the uniform covering entropy of Hi = {hu(t),t G T}, I = 0,1, grows 
without bound, albeit at a very slow rate logn. The envelope Hi of this class also grows in 
general, and so we can not rely on the usual uniform entropy-based arguments; for similar 
reasons we can not rely on the bracketing-based entropy arguments. Instead, we rely on a 
strong approximation argument, using ideas in Chernozhukov, Lee, and Rosen (2009) and 
Belloni and Chernozhukov (2009a), to show that G n [h k (t)] can be approximated by a tight 
sequence of Gaussian processes G[h(t)], implicitly indexed by k, where the latter sequence is 
very well-behaved. Even though it may not converge as k — > oo, for every subsequence of k 
there is a further subsequence along which the Gaussian process converges to a well-behaved 
Gaussian process. The latter is sufficient for carrying out the usual inference. 

Lemma U] below establishes that 

G n [h(t)] = G[h(t)] + op(l) in £°°(T), 

where G[h] is a sequence of P-Brownian bridges with the covariance function K[h(t)h(t')] — 
Fi[h(t)]E[h(t')]. Lemma [6] below establishes that for some < c < 1/2 

p 2 {h(t),h{t')) = (E[h(t) - h{t')ff /2 < p(t,t') := \\t-t'\\ c , 

and the function E[h(t)h(t')] — E[h(t)]E[h(t')] is equi-continuous on T x T uniformly in k. By 
assumption IC3I we have that inf te y var[/i(t)] > C > 0, with Lemma [6] providing a sufficient 
condition for this. 
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An immediate consequence of the above result is that we also obtain the convergence in 
the bounded Lipschitz metric 

sup |E [<7(G n [/i])] - E [ff(G[/i])]| <Esnp\G n [h(t)]-G[h(t)]\ A 1 ^ 0. 

geBLi(t°°(T),[0,i\) teT 

Step 4. Let's recognize the fact that h depends on k by using the notation h k in this step 
of the proof. Note that k itself is implicitly indexed by n. Let F k (c) := P{/(G[%]) < c} 
and observe that by Step 3 and / G T c 

\P{f(S n ) <C n + O p (l)} - P{/(G[/lfc]) < C n }\ 

< \P{f(G[h k ]) < c n + 0p (l)} - P{/(GM < c n }\ 

< S n (o p (l)) -+ P 0, for 5 n {e) := sup \F k (c + e) - F k (c)\, 

where the last step follows by the Extended Continuous Mapping Theorem (Theorem 18.11 
in van der Vaart (2000)) provided that we can show that for any e n \ 0, 5 n (e n ) — > 0. Suppose 
otherwise, then there is a subsequence along which 8 n (e n ) — )■ S 7^ 0. We can select a further 
subsequence say {rij} along which the covariance function of G n [/ifc], denoted Vt n k{t,t') 
converges to a covariance function £lo(t, t') uniformly onTxT. We can do so by the Arzela- 
Ascoli theorem in view of the uniform equicontinuity in k of the sequence of the covariance 
functions £l nk (t,t') on T x T. Moreover, infj g r rZo(i , t) > C > by our assumption on 
H n fc(i, t'). But along this subsequence G[/ifc] converges in £°°(T) in probability to a tight 
Gaussian process, say Zq. The latter happens because G[hk] converges to Zq marginally 
by Gaussianity and by £l nk {t,t') — > tto(t,t') uniformly and hence pointwise onTxT and 
because G[/ifc] is asymptotically equicontinuous as shown in the proof of Lemma [H Thus, 
along this subsequence we have that 

F k (c) F (c) = P{/(Z ) < c}, uniformly in c G R, 

because we have pointwise convergence that implies uniform convergence by Polya's theo- 
rem, since Fq is continuous by / € T c and by inf^gr £lo(t, t) > C > 0. This implies that 
along this subsequence 5 nj (e nj ) — > 0, which gives a contradiction. 

Step 5. Finally, we observe that c(l — r) = 0(1) holds by sup ter ||G[7ifc(i)]|| = Op(l) as 
shown in the proof of Lemma HI and the second part of Theorem [2] follows. □ 

B.2. Proof of Theorems 3 and 4. Step 1. We can write the difference between a 
bootstrap and true support function as the sum of three differences. 

S^g - <re,E = (org-g - o^gj + (a e g - a e>s J + (offf, E - <7 fl ,s) 

where for 

Wi^{a) =: (9 (x, a)l(fizi < 0) + 9i(x, a)l(fj,Zi > 0)) 

we define 

and CT e>E =: E n [(ei/e)^TfziWi^(a)] , 
where e = E n ej — >p 1. 
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By Lemma Q] uniformly in t £ T 

~ °6,y) (*) = I'^i^P^W^i > °}] Ji 1 (a) < G n [e i pnp il (a)] 

+ g'SE^p-llg'S^ < 0}] Jq 1 (a)G n [eiPiipio(a)] + op(l). 
By Lemma [2] uniformly in t G T 

VnyZgp - &6,v) (t) = Vnq' fs - E [^^^^(a)] + o P (l) 

= g / SG ri [(ej/e)(j;iZ-) ]SE [^^i, g 's(a)] + o P (l) 
= g / EG rv [e i (xj^) ]EE [^w ij? / E (a)] +op(l). 

By definition 

s/n{a e ^ ~ (*) = G n [e» (</£zi'u;i i g/£(a:)) ]/e = G„[e, (g'E^iu^fa!))^^ + o p (l)). 
Putting all the terms together uniformly in t G T 

x/^g - ^,s)(*) = G»[eift?(t)] + op(1). 

Step 2. Combining conclusions of Theorems 1 and Step 1 above we obtain: 

5 n (t) = Vn(a^ - 

= G n [eth?(t)]-Gn[h(t)] + op(l) 

= G n [e°h°(t)] + op(l). 

Observe that the bootstrap process G n [e°h°(t)] has the unconditional covariance function 

E[h(t)h{t')} -E[h(t)]E[h(t% 

which is equal to the covariance function of the original process G n [/ij]. Conditional on data 
the covariance function of this process is 

E n [h(t)h(t')] — E n [h(t)]E n [h(t')]. 

Comment B.l. Note that if a bootstrap random element Z n taking values in a normed 
space (E, \\ ■ ||) converges in probability P unconditionally, that is Z n = op(l), then Z n = 
ope(l) in L l (P) sense and hence probability P, where P e denotes the probability measure 
conditional on the data. In other words, Z n also converges in probability conditionally on the 
data. This follows because E P |P e {||Z n || > e}| = P{||Z n || > e} -> 0, so that P e {||Z n || > e} -> 
in L l (P) sense and hence in probability P. Similarly, if Z n = Op(l), then Z n = Op e (l) 
in probability P. 

Step 3. (Finite k). This case follows from % = {hi(t),t £ T} being a Donsker class with 
square-integrable envelopes. Indeed, T~L is formed as a Lipschitz composition of VC classes or 
entropically equivalent classes. Then by the Donsker theorem for exchangeable bootstraps, 
see e.g., van der Vaart and Wellner (1996), we have weak convergence conditional on the 
data 

G„[e?7£(i)]/e =>- G\h(t)} under P e in £°°(T) in probability P, 
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where G[h] is a sequence of P-Brownian bridges independent of G[h] and with the same dis- 
tribution as G[h]. In particular, the covariance function of G[h] is E[h(t)h(t')]—E[h(t)]E[h(t')]. 
Since e — >-p e 1 , the above implies 

G n [e°h°{t)} =>• G\h(t)] under P e in £°°(T) in probability P. 
The latter statement simply means 

sup \E F e[g(G n [h])] - E[g(G\h})}\ -+ P 0. 

geBLi(e°°(T),[0,l]) 

This statement can be strengthened to a coupling statement as in Step 4. 

Step 4. (Growing k.) By Lemma [4] below we can show that (on a suitably extended 
probability space) there exists a sequence of Gaussian processes G[/j(t)] such that 

&n[e°hKt)] = G\Ht)] + op(l) in £°°(T), 
which implies by Remark IB. II that 

G n [e°h°(t)] = G\h(t)] + o P e(l) in £°°(T) in probability. 

Here, as above, G[h] is a sequence of P-Brownian bridges independent of G[h] and with the 
same distribution as G[h]. In particular, the covariance function of G[h] is E[h(t)h(t')] — 
E[h(t)]E[h(t')]. Lemma [6] describes the properties of this covariance function, which in turn 
define the properties of this Gaussian process. 

An immediate consequence of the above result is the convergence in bounded Lipschitz 
metric 



sup 

g e£Li(£°°(r),[o,i]) 



E P 4g(G n [e°h°(t)})} - E P e[g(G[h})} < E P . sup \G n [e°h°(t)]-G[h(t)}\ Al ^ P 0. 



Note that Epe [5(G[/i])] = Ep[g(G[/i])], since the covariance function of G[h] does not depend 
on the data. Therefore 

sup \E P 4g(G n [e°h°(t)])] - E P [g(G\h])]\ ^ P 0. 

g eBLi(e°°(T),[0,l]) 

Step 5. Let us recognize the fact that h depends on k by using the notation in 
this step of the proof. Note that k itself is implicitly indexed by n. By the previous 

steps and Theorem 1 there exist e n \ such that tti = P e {|/(5 n ) — /(G[/ife])| > e n } and 
vr 2 = P{|/(Sn) - f(G[h k })\ > e n } obey E[tq] and tt 2 0. Let 

F(c) := P{f(G[h k ]) <c} = P{/(G[%]) <c} = P e {f(G[h k ]) < c}, 

where the equality holds because G[/ifc] and G[/ifc] are P-Brownian bridges with the same 
covariance kernel, which in the case of the bootstrap does not depend on the data. 
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For any c n which is a measurable function of the data, 

E|P e {/(S n ) < c n } - P{f(S n ) < c n }\ 

< E[P e {/(G[%j) <c n + e n } - P{f(G[h k }) <c n - e n } +-k x + tt 2 ] 
= EF{c n + e n ) - EF{c n - e n ) + o(l) 

< sup \F(c + en) - F{c - e n )\ + o(l) = o(l), 



where the last step follows from the proof of Theorem 1. This proves the first claim of 
Theorem 4 by the Chebyshev inequality. The second claim of Theorem 4 follows similarly 
to Step 5 in the proof of Theorems 1-2. □ 

B.3. Main Lemmas for the Proofs of Theorems 1 and 2. 
Lemma 1 (Linearization). 1. (Sample) We have that uniformly in t € T 

(%,£ ~ ^e,g) = q'^V™ (^nZi \0i,i(a) - lti (a)j 1 |</S^ > j 

+ q'tyfti fen* (e ,i(a) - fl 0) i(a)) 1 {A < o} 
= </£E[^l{g'E^ > 0}]Jf 1 (a)G n [ K ^ 1 (a)] 
+ q'XElzip'ilWZzi < 0}] J - 1 (a)G n [ K ^ (a)] + o P (l). 

2. (Bootstrap) We have that uniformly in t S T 

(^s ~~ (*) = Q^Vn (^ n (ei/e)zi \ 0i,i(a) - 0i,t(a)J 1 [g'Ezj > ojj 

+ g'Ey^ (E n ( ei /e)^ (e ,i(a) - M«)) 1 {A < o}) 

= </£E[Vil{</^ > 0}]Jf 1 (a)G n [e i p i ^i(a)] 

+ g / EE[^pJl{g / E^ < 0}] J _1 (a)G n [e i p i ( ) 5 i o(a)] + op(l). 

Proof of Lemma Q3 In order to cover both cases with one proof, we will use 6 to mean 
either the unweighted estimator 8 or the weighted estimator 6 and so on, and Vi to mean 
either 1 in the case of the unweighted estimator or exponential weights e, in the case of the 
weighted estimator. We also observe that £ — >p E by the law of large numbers and the 
continuous mapping theorem. 

Step 1. It will suffice to show that 

q'^ {E n (vi/v)zi (0 M (a) - M (a)) 1 {q J Ez i > 0}) 
= q'T,E[zip'il{qEzi > 0}] J^ 1 (a)G n [viPitpn(a)] + o P (l) 

and that 

q't^^nivi/^Zi (6 ,i(a) - Oj i(a)) 1 {q'tz, < 0}) 
= q'T,E[z i p' i l{qT,Zi < 0}] J " 1 (a)G Tl ['Ui^jo(a)] + op(l). 
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We show the argument for the first part; the argument for the second part is identical. We 
also drop the index t = 1 to ease notation. By Assumption IC2I we write 

^y/nEn(vi/v)zi (Si - Oi) 1 {qtzi > 0} = {q'm n [(v i z i p' i l{q'f]z i > 0}] J- 1 (a)G n [w;(a)] 

+q't-E n [u i z i R i {a)l{q'tz l > 0}]} /v 
= : (o(a) + 6(a))/(l + o P (l)). 

We have from the assumptions of the theorem 

sup 1 6(a) | < ||g'E|| • |||N|||p„,2|||HH|p„,2 ■ sup ||-Ri(a)||p n , 2 = Op(l)Op(l)o P (l) = o P (l). 

a£A a&A 

Write a(a) = c(a) + d(a), where 

c(a) := q , Y,E[z i p' i l{qY,Zi > 0}] J~ 1 (a)G„ [v iPi (pi(a)] 
d(a) := p!j~ l (a)G n [v iPi ipi(a)} 

fl' := q'mniviZip'^iq'EZi > 0}] - ^'EE^l^'E^ > 0}] (B.9) 

The claim follows after showing that sup Qg ^ M( a )| = °p(1)j which is shown in subsequent 
steps below. 

Step 2. (Special case, with k fixed). This is the parametric case, which is trivial. In 
this step we have to show sup Qg _4 \d(a)\ = o P (l). We can write 

d(a) = G n [p!f a ], f a := {f a j,j = l,...,k) , f aj := J' 1 (a)v iP ijipi(a) 

and define the function class T := {f a j,a £ A, j = 1, k}. Since k is finite, and given the 
assumptions on = {(p(a), a £ A}, application of Lemmas [8] and [H-2 (a) yields 

suplogiV(e||F||Q i2 ,^,L 2 (Q)) < log(l/e). 
Q 

and the envelope is P-square integrable. Therefore, J- is P-Donsker and 

sup |G n [/ a ]| < P 1 

and sup^^ \d(a)\ < P k\\p\\ ->- P 0. 

Step 3. (General case, with k — > oo). In this step we have to show sup a6 _4 \d(a)\ = o P (l). 
The case of k — > oo is much more difficult if we want to impose rather weak conditions on 
the number of series terms. We can write 

d(a) = G n [f an ], f an := p! J" 1 (a)vipnpi{a) 

and define the function class T% := {fan, a E A}, see equation (IB. 13|) below. By Lemma [9] 
the random entropy of this function class obeys 

log N(e\\F 3 || P „ >2 , J" 3 , L 2 (P n )) < P log n + log(l/e). 

Therefore by Lemma [TTT conditional on X n = {x^Zi,i = 1, n), for each 5 > there exists 
a constant K$, that does not depend on n, such that for all n: 

P <^ sup \d(a)\ > Ks^logn I sup ||/ Q „||p n , 2 V sup ||/ Qn ||p|x„,2 ) > < S, 

la£A \a£A a£A / ) 
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where P|X n denotes the probability measure conditional on X n . The conclusion follows if 
we can demonstrate that sup^g^ ||/an||p„,2 V sup ag _4 ||/em||p|x n ,2 ~~ ^P 0- To show this note 
that 

SUp ||/an||p n ,2 < SUp || J" 1 («) || \\& n PiP'i \\ ' SUp SUp \(fi(a)\ ->p 0, 

a£A a£A i<n i<n,a£A 

where the convergence to zero in probability follows because 

\\P-W <p n" m/4 + y 7 (k/n) ■ logn • (log n max ||^||) A£ fc , sup^ < P logn 

by Step 4 below, sup Qg _4 || J _1 (a)|| < 1 by assumption IC31 ||E n pjp^|| <p 1 by Lemma [TU1 
and 

log 2 n ( n~ m / 4 + \J (k/n) • log n ■ max \\zi\\ A ) sup |y?j(a)| — >p 
V * / i<n,aeA 

by assumption IC4L Also note that 



1 1/2 

l^nWill • K^["il) ' ■ s^P L^L^j WP'i,2tJ. 

by the preceding argument and E[92 2 (u)|xj, z$] uniformly bounded in a and i by assumption 



sup ||/an||p|x n! 2 < IIHI sup ||J 1 (o:)||||IEr 1 .PiPil| ■ (E[f 2 ]) 1/2 ■ sup \E[ipf (u)\x i , Zi\] 1 2 -> P 0, 

a£A i<n,a£A 



Step 4. In this step we show that 

WftW <P ™~ m/4 + \J(kjn) • logn • (logn max ||^||) A£ fc . 

i 

We can bound 

\\p,\\ < ||S - S||||E > 0}] || + ||S||ui + ||S||// 2 , 

where 

H! = ||E n [viZip'iiy^Zi > 0}] - E [zip'iliq'Zzi > 0}] || 
u. 2 = WEniviZip'iiliq'Zzi > 0} - l{q't Zl > 0}}}\\. 
By Lemma [TOl ||S — S|| = o P (l), and from Assumption IC3I ||E [ziPil{q'T,Zi > 0}] || < 1. 
By elementary inequalities 

U 2 2 < EnlKllXll^irilEnbiPilllEnKl^Szi > 0} - > 0}} 2 ] < P n^ m / 2 , 

where we used the Chebyshev inequality along with E||vj || 2 = 1 and E[||zj|| 2 ] < oo, ||E n [pjp^ || <p 
1 by Lemmadfll and E n [{l{q'T,Zi > 0} - l{q't, Zi > 0}} 2 ] < P n~ m / 2 by Step 5 below. 

We can write fi\ = sup g6 g |E n <7 — Eg\, where Q := {vrf ' Zip'^qY^q'YjZi > 0}, ||7|| = 1, ||?/|| = 
1}. The function class Q obeys 

suplogiV( e ||G||Q, 2 ,g,L 2 (Q)) < (dim(0 i ) + dim(p i ))log(l/e) < fclog(l/e) 
Q 

for the envelope Gi = Vi\\z\\i ■ ^ that obeys maxjlogGj <p logn by -E|uj| p < oo for any 
p > 0, E||zj|| 2 < oo and log^ <p logn. Invoking Lemma[TT]we obtain 



Ml <P V( k / n ) -togn x sup ||p||p n ,2 V sup ||fif||p,2, 
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where 

sup||g||F n ,2 ^max||zi|| maxvi ■ ||E„[pjp'J||J A ( [E^Hu^H 2 ] 1/2 £ fc ) 
< P (max^maxll^H) A £ fc < P (lognmax \\zi\\) A£ k 

it i 

by EH^II 2 < oo and by E n [p^] < P 1, max^ < P logn and sup 36g ||ff||p,2 = < 1 by 

Assumption IC3I Thus 



fix < P \J (k/n) ■ logn(max||zi|| logn) A & 



and the claim of the step follows. 
Step 5. Here we show 



sup E n 



[ltfEzi < 0) - l(q't Zi < 0)f 



<P n- m '\ 



Note (l(q'T,Zi < 0) - l^Ez; < 0)) = l(q'T,Zi < < q"Ezi) + > > q J Ez i ). The 

set 

T = h(q'T,Zi < < q'tzi) + l^Ezj > > q'T,Zi), q G ||E|| < M, ||E|| < m} 

is P-Donsker because it is a VC class with a constant envelope. Therefore, |E n / — E/| <p 
n~ 1 / 2 uniformly on / G J. Hence uniformly in q £ S^ 1 , E n [(l(q'T,Zi < 0) - l(q'T,'zi < 0)) 2 ] 
is equal to 

E [ltfVzi < < q't! Zi ) + itfYiZi > > g'S'z;)] + P f n" 1/2 ' 
= P (\jEzi\ < \q' (E - E) + Op (n" 1 / 2 ' 
< ||S - S|| m + Op (V 1 / 2 ) < P n" m / 2 + n^ 1 / 2 < P n m l 2 
where we are using that for < m < 1 

P (|</£^| < \q' (S - E) z(|) < P (Ig'Ezi/ll^illl < IMIH S " £ P " E IH 

where the last inequality holds by Assumption lCH which gives that P (|g'££j/||,2j||| < 5) jb m < 
1. □ 

Lemma 2. Let Wi )lx (a) =: (6o(x, a)l(fiZi < 0) + 6x{x, a)l(fiZi > 0)). 1. (Sample) Then 
uniformly in t £ T 



vM^0,£ ~ ^e,s J (*) = vW f S - EJ E [^iOi,,'s(a)] + o P (l) 
2. (Bootstrap) Then uniformly in t £ T 

"J™(°e,t ~ (*) = vW f E - s) E [^Wi,g's(a)] + o P (l) 

Proof of Lemma [2j In order to cover both cases with one proof, we will use 6 to mean 
either the unweighted estimator 9 or the weighted estimator 6 and so on, and Vi to mean 
either 1 in the case of the unweighted estimator or exponential weights e, in the case of the 
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weighted estimator. We also observe that E — >p E by the law of large numbers (Lemma 
flu]) and the continuous mapping theorem. 

Step 1. Define T = {q'Y,ZiWi t q/-£(t) : t G T, ||E|| < C} . We have that for fi(t) = q'T l ZiW iq if j 
and fi(t) = q'Y*ZiWi^iY,(t) by definition 

V^Kfi = V^n[(Vi/v)(fi(t) - fi(t))] 

= y/nE\fi(t) - fi(t)} + G n [vi(fi(t) - h{t)) ]/v 

= V^m(t) - IS)] + G„[«i(/i(t) - /i(t))°]/(l + o P (l)). 

By intermediate value expansion and Lemma [3J uniformly in a G A and (/ G 

^(E[/i(*)-/i(f)]) = v / ^(<? / E-<? / S)E^^ E , (t) (t)] = ^ / S-g / S)E[^ ws (t)] + 0p (l), 

for gE*(i) on the line connecting </E and g'S, where the last step follows by the uniform 
continuity of the mapping (a,q'T,) \— > ~E[ziWi tq i^(t)] and g'E — g'E — >p 0. Furthermore 
sup tgT \G n [vi(fi(t) — fi(t))°]\ —tp by Step 2 below, proving the claim of the Lemma. 

Step 2. It suffices to show that for any t G T, we have that G n [vi \fi(t) — fi(t)]°] —>p 0. 
By Lemma 19.24 from van der Vaart (2000) it follows that if Vi [fi(t) - fi(t)]° G Q = 
Vi{{F — is such that 

(E [(Vi(fi(t) - fi(t))°) 2 }) 1/2 < 2 (E [(Vi(ji(t) - hit))?]) 112 ->P 0, 

and Q is P-Donsker, then G n [vi(fi(t) — fi(t))°] — >p 0. Here Q is P-Donsker because T is a P- 
Donsker class formed by taking products of Ti 5 {0u{ol) : a G A, £ = 0, 1} , which possess 
a square-integrable envelope, with bounded VC classes {1((/Ezj > 0),q G 5 d_1 ,||E|| < 
C} and {l(q'T>Zi < 0),g G S d ~ l , ||E|| < C} and then summing followed by demeaning. 
The difference (J 7 — J~)° is also P-Donsker, and its product with the independent square- 
integrable variable Vi is still a P-Donsker class with a square-integrable envelope. The 
functions class has a square-integrable envelope. Note that 



E[/i(t) - h(t)f 



E 



\ 



< 



IS - SI 



IP,2 



(q'Z - q , T)z i Ol (a)l(q , Z , z l < 0)1 (g'Ez, < 0) 
+ (q't - q%Zi9 u {a)l (q'tz, > 0) 1 (q'E Zi > 0) 
+ (q't Zi e 0i {a) - q'ZziOaia)) 1 (q't Zl < < q'Tz, 
\ + (q'^ZiOuia) - ^ZiOoiia)) 1 (g'Sz, > > g'E^) / 

elia 



IP, 2 



max 



a6Ate{o,i} 



P,2 



+ 



|E||p V ||E" 2 



\po max ||^Ii(a)[|p,2 ■ sup P [k'EzJ < \q' (E - E) 



< 



p||S — E|| 2 + sup P [(g'E; 



< HE - E| 



1/2 



o. 



□ 



where we invoked the moment and smoothness assumptions. 

Lemma 3 (A Uniform Derivative). Let o~i^(a) = jjLZi (9oi(a)l(fj,Zi < 0) + 8u(a)l(fiZi > 0)). 
Uniformly in n G M = {q'T, : q G ||E|| < C} and a £ A 



E[ziWi^(a)], 
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where the right hand side is uniformly continuous in \x and a. 

Proof: The continuity of the mapping (/i, a) i— > E[ziWi^{u)] follows by an application of 
the dominated convergence theorem and stated assumptions on the envelopes. 

Note that for any \\S\\ -)• 

E[(n + 8)ziWi^ +5 (a)} - E\fj,ZiW itfi (a)] 5 1 

Pi = Ff pj| E i i ^( <5 >/ i > Q! )j> 

where 

Ri(6, fi, a) := (fi + 5) z { (9u(a) - 6 0i (a)) 1 (jiZi < < (pt + 5) zi) 
+ (jt + 5) Zi (6 0i (a) - 0ii(a)) 1 (/^ > > (/x + 5) «j) . 
By Cauchy-Schwarz and the maintained assumptions 

sup E\Ri(5,n,a)\ < \\Sz\\ Pj 2 ■ sup ||^i(a)||p,2 sup [P (\/j,z\ < \Sz\)] 1/2 

H&M,aeA aeA,£e{0,l} n£M,a£A 

< \\5z\\ P , 2 -l-5 m / 2 . 

Therefore, as \\5\\ — > 

sup ■^ i \E[R i (S, t i,u)]\< sup ■±r i E\R i (6,ti,a)\ < 5 m ^ 2 ^ 0. □ 

li£M,a£A \\ d \\ /i£M,aGA \\ d \\ 



Lemma 4 (Coupling Lemma). 1. (Sample) We have that 

G n [h(t)] = G[h(t)] + o P (l) in £°°(T), 
where G is a P-Brownian bridge with covariance function E[h(t)h(t')] — E[h(t)]E[h(t')]. 
2. (Bootstrap). We have that 

G n [e°h°(t)} = G\h(t)} + op(l) in 1°°{T), 
where G is a P-Brownian bridge with covariance function E[h(t)h(t')] — E[h(t)]E[h(t')]. 

Proof. The proof can be accomplished by using a single common notation. Specifically it 
will suffice to show that for either the case gi = 1 or gi = — 1 

G n [gh°] = G 9 [h(t)} + op(l) in £°°(T), 

where G is a P-Brownian bridge with covariance function E[h(t)h(t')] — E[h(t)]E[h(t')]. The 
process G 9 for the case of gi = 1 is different (in fact independent) of the process G 9 for the 
case of 5i = ej — 1, but they both have identical distributions. Once we understand this, we 
can drop the index g for the process. 

Within this proof, it will be convenient to define: 

S n (t) :=G n [gh°(t)} and Z n (t) := G[h(t)]. 

Let Bjk,j = 1, ...,p be a partition of T into sets of diameter at most We need at 
most 

p<j d , d = dim(T) 
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such partition sets. Choose tjk as arbitrary points in Bjk, for all j = 1, ...,p. We define the 
sequence of projections iTj : T — > T, j = 0, 1, 2, . . . , oo by 7Cj(t) = tjk if t € Bjk- 

In what follows, given a process Z in £°°(T) and its projection Z oiTj, whose paths are 
constant over the partition set, we shall identify the process Z o 7r,- with a random vector 
Z o nj in M p , when convenient. Analogously, given a random vector Z in M p we identify 
it with a process Z in £°°(T), whose paths are constant over the elements of the partition 
sets. 

The result follows from the following relations proven below: 

1. Finite-Dimensional Approximation. As j/logn — > oo, then Ai = sup igr ||£ n (i) — 

S n OTTj(t)\\ ^ P 0. 

2. Coupling with a Normal Vector. There exists N n j =d N(0,var[S n o 7r,]) such 
that, if p 5 S,l/n — > 0, then A2 = sup,,- \N n j — S n o %j\ — >p 0. 

3. Embedding a Normal Vector into a Gaussian Process. There exists a Gaussian 
process Z n with the properties stated in the lemma such that N n j = Z n o Hj almost surely. 

4. Infinite-Dimensional Approximation, if j — > 00, then A3 = sup tg y \Z n (t) — Z n o 

TTj(t)\ ^P 0. 

We can select the sequence j = log 2 n such that the conditions on j stated in relations 
(l)-(4) hold. We then conclude using the triangle inequality that 

sup \S n (t) - Z n (t)\ < Ai + A 2 + A 3 -> P 0. 

Relation 1 follows from 

A ± = sup \S n (t) - S n o it j(t)\ < sup \S n (t) - S n (t')\ -)-p 0, 

t£T ||<-*'||<i _1 

where the last inequality holds by Lemma [5j 

Relation 2 follows from the use of Yurinskii's coupling (Pollard (2002, page 244)): Let 
Ci, . . . , Cn. be independent p-vectors with = for each i, and n := Y2i E [IIC«I| 3 ] finite. 
Let S = Ci + • + Cn- F° r each 6 > there exists a random vector T with a iV(0, var(S*)) 
distribution such that 

P{||5 - T\\ > 36} < C B (l + iigg^Z^li j where 5 := K p<T 3 , 

for some universal constant Co- 

In order to apply the coupling, we collapse S n o 7Tj to a p-vector, and we let 
Ci = Cm + - + C4 4 G M p , Ck = 7T G IT, 
where /i/j,/ = 1, ...,4 are defined in ()B.8p . so that S n o 7Tj = X)^=i Ci/V™- Now note that 
since E[||Ci|| 3 ] < maxi< z < 4 E[||0i|| 3 ] and 



W=i 

< p 3 / 2 supE|^(t fcj )| 3 E| 5i | 3 , 
teT 



3/2 



5 3/2 E (^Iy 
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where we use the independence of we have that 

n\\Q\\ 3 } < P 3/2 max su P E|^(i)| 3 E| 5i | 3 . 

Next we bound the right side of the display above for each I. First, for A(t) := 
q'T,E[z i p' i l{q'T,Zi > 0}}J^(a) 

su V E\h° u (t))\ 3 = su V E \ A(t) Pmi (a)\ 3 < sup \\A(t)f ■ sup E\5' Pi \ 3 sup E[|c^(a)| 3 |x; = x] 

teT teT teT ||<5||=1 aeA,xeX 

< sup E\5'pi\ 3 sup E[\ipi(a)\ 3 \xi = x] 
1 1 <5 1 1 = 1 aeA,xex 

< ik sup E\5'pi\ 2 sup E[\tpi(a)\ 3 \xi = x] < £ k , 

\\8\\=1 a£A,x£X 

where we used the assumption that sup ag _4 xeX E[|9Jj(a)| 3 |xj = x] < 1, ||Epij?i|| < 1, and 
that 

supP(t)||< sup [E[z[5\ 2 ] 1 ' 2 suplE^n^supllJ-^a)^!, 

teT \\6\\=1 ||<5||=1 aeA 

where the last bound is true by assumption. Similarly E|/i2j(i))| 3 < Next 
supE\h° 3i (t)\ 3 = supElg'S (a;^) £1 EE^ ) ^ s (a)] | 3 



teT 



teT 



< E\\ (xiz'j) || 3 sup || [E[ziWi tq ^(a)]\ 



teT 



< (E||(^;)|r + ||E(x^)|n (E||^|| 2 ) 3/2 supE |%(a) 



aG.4 



3/2 



< 1, 



(B.10) 



where the last bound follows from assumptions IC3I Finally, 

sup [E|/i4j(t)| 3 ] 1/3 = sup [E\q , T,z i w itq ^(a)\ 3 ] 1 3 + sup \Eq'T:ziW itq ^(a) 



teT 



teT 



teT 



< 2sup[E|</£z l | 6 ] 1/6 [E\w ws (a 

teT 

< [El^lY/SupEO^^o;)! 6 ] 1 /^!, 

aeA 



161 1/6 



where the last line follows from assumption 

Therefore, by Yurinskii's coupling, observing that in our case by the above arguments 
K = ^tj^xt-i for each 5 > if p 5 ^/n -»• 0, 



n 



<> -j 



< npp 3 / 2 Ck = p 5/2 tk n 
' ~ (5v^) 3 (5 3 nV2) u - 



This verifies relation (2). 

Relation (3) follows from the a.s. embedding of a finite-dimensional random normal 
vector into a path of a Gaussian process whose paths are continuous with respect to the 
standard metric p2, defined in Lemma El which is shown e.g., in Belloni and Chernozhukov 
(2009). Moreover, since p2 is continuous with respect to the Euclidian metric on T, as 
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shown in part 2 of Lemma El the paths of the process are continuous with respect to the 
Euclidian metric as well. 

Relation (4) follows from the inequality 

A 3 = sup \Z n (t) - Z n o 7r,-(t)l < sup \Z n (t) - Z n (t')\ < P (l/j) c log(l/i) c -> 0, 
teT \\t-t<\\<i-i 

where 0<c<l/2is defined in Lemma [6l This inequality follows from the entropy 
inequality for Gaussian processes (Corollary 2.2.8 of van der Vaart and Wellner (1996)) 



E sup \Z n (t) - Z n (t' )\ < J ^/log N{e,T,p 2 )de 
P2{t,t')<8 

and parts 2 and 3 of Lemma [6l From part 2 of Lemma [6] we first conclude that 

logiV(e,r,p 2 )<log(l/e), 

and second that ||f — t'\\ < implies p 2 (t,t') < (1/j) 6 , so that 

E sup \Z n {t) - Z n (t')\ < (l/j) c log(l/j) c as j -> oo. 
ll*-*'ll<Vi 

The claimed inequality then follows by Markov inequality. □ 
Lemma 5 (Bounded Oscillations). 1. (Sample) For e n = o^logn)" 1 ^ 2 ^), we have that 

sup \& n [h(t) - h(t')]\ -»-p 0. 

||*-*'||<en 

2. (Bootstrap). For e n = o((log n) _1// ^ 2c ^), we have that 

sup |G n [(ei - l)(h°(t) - h°(t'))]\ ^ P 0. 

\\t-t'\\<e n 

Proof. To show both statements, it will suffice to show that for either the case gi = 1 
or gi = ei — 1, we have that 

sup \G n [ gi (h°(t)-h°(t?))}\^pO. 

||t-f ||<e« 

Step 1. Since 

sup \G n [ gi (h (t)-h°(t'))}\ < max sup \G n [ gi (h° e (t) - h° e (t'))}\, 

\\t-t'\\<e n l<^<4|| t _ t /||< £n 

we bound the latter for each t. Using the results in Lemma [9] that bound the random 
entropy of 7~L\ and %2 and the results in Lemma [TT1 we have that for t = 1 and 2 



A n£ = sup \G n [9i(hi{t) - hj(t'))]\ < P y'logn sup max \\gi(h^(t) - h^(t'))\\ P2 - 

\\t-t'\\<e n ||t-t'||<e n P6{P,P„} 

By Lemma [9] that bounds the entropy of gi(Ji\ — 'HI) 2 and Lemma ITT1 we have that for 
£ = 1 or £ = 2, 



sup 

||i-i'||<e„ 



\/^psu P max \\g 2 hf(t)\\ v , 2 . 
n teT P6{P,P n } 
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By Step 2 below we have 



sup max \\g 2 hf (t)\\ P>2 < P k 2 maxF 4 max | 9i | 4 < P /£ 2 maxF 4 (logn) 4 , 
and by Lemma El ||g(/tg(i) — ^(i'))l|p,2 ^ ||* — t'\\ c • Putting the terms together we conclude 



A„f < P Tioi^ ^ + ^^£ fe ^maxF 4 (logn) 2 J -» P 0, 

by assumption and the choice of e n . 

For £ = 3 and £ = 4, by Lemma EJ ^(ftg - Wg) and - U%) are P-Donsker, so that 

A ne < sup |G n [ 5 (^(*)-^(*0)]l^pO. □ 

P2tt,t' n )<e° 

Step 2. Since || 5 2 /if (i)|| P)2 < 2|| 5 2 ^(t)|| Pi2 + 2\\g 2 E[h°(t)] 2 \\ ¥ , 2 , for F G {P,P n }, it 
suffices to bound each term separately. 

Uniformly in t G T for £ = 1,2 

® n [gh(t)] 4 < max,? 4 • ||E|| 4 ||E n [^l{g'E^ < 0}]|||| ^V)!! • sup E n [[5'pi]Vio(«)] 

11-511=1 

< P (log?i) 4 • 1 • e k \\®n\pip'i\\\ maxF 4 < P maxF 4 (logn) 4 , 

i<n i<n 

where we used assumptions IC3I and IC5I and the fact that ||E n [ziPil{q'T,Zi < 0}]|| <p 1 and 
E n \pip';] <p 1 as shown in the proof of Lemma [TJ 

Uniformly in t G T for £=1,2 

E[^(t)] 4 < Eb7 4 ]||E|| 4 ||E[« tf i i l{g'S^<0}]||||J5- 1 (a)l|- sup E[[^] 4 ^o(«)] 

ll*ll=i 

< P 1 ■ e k \\^\PiPi]\\ sup E[<pj Q (a)\x i =x]<Zl 



where we used assumption! 
Uniformly in t G T for £=1,2 

E n [g A E[h°M) A \ < E n g A E[hm} 4 <p 1 ■ V[hf (t)] 2 < 1, 

and 

E[/E[^(t)] 4 ] < E 5 4 E[/^(i)] 4 < 1 • E[hf(t)} 2 < 1. 
where the bound in E[h^ 2 (t)] 2 follows from calculations given in the proof of Lemma (6) □ 

Lemma 6 (Covariance Properties). 1. For some < c < 1/2 

p 2 (h(t), h{t')) = {E[h(t) - MO] 2 ) V2 < Pit, t') :=\\t- t>\\ c 

2. The covariance function E[h(t)h(t')] — E[h(t)]E[h(t')] is equi- continuous on T x T 
uniformly in k. 



INFERENCE FOR BEST LINEAR APPROXIMATIONS TO SET IDENTIFIED FUNCTIONS 



17 



3. A sufficient condition for the variance function to be bounded away from zero, 
inft e r var(/t(t)) > L > 0, uniformly in k is that the following matrices have minimal 

eigenvalues bounded away from zero uniformly in k: var([(pn(a) c/?jo( a )]' \ x ii z i) 

E\pip'A, Jq 1 {q), J± (a), b' Q bo, and b^bi, where b± = E[zip'A{q'Y,Zi > 0}] and 
b = E[z iP ' l l{q'J:z l <0}]. 

Comment B.2. We emphasize that claim 3 only gives sufficient conditions for vax(h(t)) 
to be bounded away from zero. In particular, the assumption that 



mmeig I var 



[(filial) (fio{a)]' \xi,zij J > L 



is not necessary, and does not hold in all relevant situations. For example, when the upper 
and lower bounds have first-order equivalent asymptotics, which can occur in the point- 
identified and local to point-identified cases, this condition fails. However, the result still 
follows from equation (IB. lip under the assumption that 

var ((fa(a)\xi,Zi) = var (<p i0 (a)\xi, Zi) > L 



Proof. Claim 1. Observe that p2 (h(t),h(i)) < max.,- p2 (hj(t),hj(i)) . We will bound each 
of these four terms. For the first term, we have 



P2 (h 1 (t),h 1 (t)) =E 
<E 
+ E 



(/EE [z iP 'A {q'Zzi > 0}] Jf 1 (a) p mi (a) - 
-g'EE [zip'A {q'Zzi > 0}] Jf 1 (a) p mi (a 

((q - g)'SE [zip\\ {q'T,Zi > 0}] Jf 1 (a)pmi (a)) 



1/2 



1/2 



+ 



E [zip'A {q"Ezi > 0}] - 



1/2 



+ 



+ E 
+ E 

For the first term we have 



(g'EE [zip'A {q'Zzi > 0}] ( Jf 1 (a) - Jf 1 (a)) Pi(fn (a))' 
(g'EE [zip'A {qT.Zi > 0}] Jf 1 (a)pi {tpn (a) - <pn (a))) 2 



1/2 



+ 



1/2 



E 



((q - g)'EE [zip'A {g'Ezj > 0}] J x 1 (a) pupa (a))' 



1/2 



< 



< \\q-q\\ ll s ll || E [^p'i]|| \\ J i 1 ( a )\\ E [\\piPi\\ 2 } 1/2 supE[^( 



Ql) \Xj* Zi 



1/2 



By assumption lC3l ||E[zj^]||, 1 (a)|| , E[||pjj/J| 2 ], and sup x% z . E\pn{a) A \xi, Zi] are bounded 
uniformly in k and a. Therefore, 



E 



((q - g)'EE [zip'A {g'Ez; > 0}] J x 1 (a)pmi 



1/2 



^ II? -51 
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The same conditions give the following bound on the second term. 



E 



q s { -EMHfrzi > o}] 1 Jl {a)pmi {a) 

< ||E [1 {q'X Zi > 0} - 1 {fe > 0}] 



1/2 



< E 



1 {g'Ez; > 0} - 1 {q'Vzi > 0}) : 



1/2 



As in step 5 of the proof of Lemma [TJ the assumption that P (|g / E,Zj/||,Zj||| < 5) /S m < 1 
implies 



E 



;i{g'Sz 4 >0}-l{fe>0}) 2 <|k-gH m/2 



1/2 



Similarly, the third term is bounded as follows: 



E 



(q'HE [Vil {fe > 0}] (Jf 1 (a) - Jf 1 (a)) (a)) : 



1/2 



< || Jf 1 (a)- Jf 1 ^)! 



Note that J x 1 (q) is uniformly Lipschitz in a G by assumption lC3| so II 1 (a) — J x 1 (a)|| < 
|| a — a || . Finally, the fourth term is bounded by 



E 



(<?'£E [zfpjl {fe > 0}] Jf 1 (a) ^ (^i(a) 

<supE (</3a(a) - tpn(a)) \x i: Zi 
la — a 



Vii(a)))' 



1/2 



where we used the assumption that E 
continuous in a. Combining, we have 



4 1 1/2 

(<^ii(aO — (pn(a)) \xi,Z{ is uniformly 7^-Holder 



P2{hi(t)M(t))<h-q\\ + \\q-q\\ m/2 



+ \\a — all + \\a — a 



<\\t-t' 



|lAm/2A7 v 



An identical argument shows that p2 (h,2(t), h,2(t')) < ||i — t' 



I lAm/2A7( /: 
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The third and fourth components of h{t) can be bounded using similar arguments. For 
hs(t), we have 



< E 
+E 
+E 



p 2 (h 3 (t),h 3 (t)) = E (g'SxjZ-SE [ziU>i^z(a)] - <f Ea^EE [^^(a)] ) 2 

((5 - qO'Ex^-EE [ziWi^x(a)]) + 
(g'ExjZ-E (E [^^^^(a)] - E [ziW it ^(a)] )) 2 
(g'ExjZ-E (E [^^4,^2(0)] - E [^^i,g's(a)])) 2 



1/2 



1/2 



+ 



1/2 



< ||g-g||||E||E[^^max } ^(^,«) 2 ] 1/2 + 

+ ||E|| E z-zi (Q\(xi,a) - 6 (xi,a)) 2 \{\{q - q)'T,Zi\ > \q'T,Zi\} 



1/2 



+ 



+ IIEII E 



z'^i max (61 (xi, a) - 61 (xi, a)) 2 
fe{o,i} 



1/2 



By assumption, E ^[z-fi^Xi, a) 2 ] < ^E 
a. Also, 



E [O t (xi, 



a) 



1/2 



is bounded uniformly in 



E 



z'iZi (6»i (xu a) - 6 (xi,a)) 2 l{\q'T,Zi\/ < \\q - q\\} 



< 



< E 



1/2 



E [^a) 4 ] 172 +E [e (x u a) A Y /Z ) E [l {|</£^|/ |N| < || 9 - q 



41 1/2 



,1/2 



<ik-9ir /2 



where we have used the smoothness condition ()Cip and the fact that E[||zj|| 4 ] < 00 and 
Fi[6i(xi, a) 4 ] < 00 uniformly in a. 

By assumption, 0£(x,a) are Holder continuous in a with coefficient L(x), so 







1/2 




E 


z^i max (6>£ (xj, a) - 9g (xi, a)) 2 

fe{0,l} 


<E 











1/2 



E [L(xi 



) 4 l 1/2 || a -fiP 



< lla - d|| 7e 



Thus, 



P2(Mi),M*))<lk-<7ll + lk-<? 

< Ik — ^|| 1Am /2A7e 



\ m/2 + \\a-a\V e 
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For h,4, we have 



P2 (h4(t),h&(E)) =E (q , Ez i w ii g^(a) - q l Y 1 z i w i ^ T ,(a))' 



<E 



((<? - q)' z>ZiWi, q >x(a)y 



1/2 



+ 



1/2 



1/2 



+ E (q'T,Zi (w ij(? 's(a) - w^q'T, + 
+ E (<f Szi (iOj i? / E (a) - Wj,,j'E(a))) 2 

<\\q-q\\ + \\q-q\\ m/2 + \\a-a\r 
by the exact same arguments used for /13. 

Claim 2. It suffices to show that E [hj(t)] for j = 1, 4 and E [hj(t)h k (t')] for j = 1, 4 
and = 1,...,4 are uniformly equicontinuous. Holder continuity implies equicontinuity, so 
we show that each of these functions are uniformly Holder continuous. 

Jensen's inequality and the result in Part 1 show that E[/t,-(i)] are uniformly Holder. 



|E[fy(t)]-E[fy(0]| <E[(^(*)-M0) : 



1/2 



< II* — *' 



Given this, a simple calculation shows that E [hj(ti)h k (t 2 )] are uniformly Holder as well. 



\K[hj(t x )h k (t 2 ) - hiWJhkfa 



E 



<E 



(h j (t 1 )-h j (t' 1 ))h k (t 2 )+ 
+h J {t' 1 ){h k {t 2 )-h k {t' 2 )) 

(h j (t 1 )-h J (t[)) 2 ] 1/2 E[h k (t 2 f] 1/2 + 



+ E[h i (i / 1 ) 2 ] 1/2 E[(^(t 2 )-/ ife (t' 2 )) : 

<ii*i-*iirv 11*2 — *2ir 

Claim 3. By the law of total variance, 

var(/i(t)) = E[vax (h(t)\xi,Zi)] +vax(E[h(t)\xi,Zi]) ■ 



var (h(t)\xi,Zi) =vai(hi(t) + h 2 (t)\xi,Zi) 

_ VEE^l^E^ > 0}]J^ l (a) Pi 



1/2 



Zj, so 




V f 


<Pii(oi) 


var 1 


_¥>io(«)_ 



(B.ll) 



Recall that b\ = E[z i p-l{g / Sz i > 0}] and b = E[zip' i l{q'T,Zi < 0}]. Let 7^ = g'E&i, and 
mineig(M) denote the minimal eigenvalue of any matrix M. By assumption, 



mineig (var ( (a) <Pio(a)]' 2^ > L, 
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so 



E [var (h(t)\xi, z { )] >E || [7^ 1 (a)p i 70 J 1 (a)p i ] || 2 

1 71 Jf 1 (a)p i || 2 V j^o^o" 1 ^)^ 



>E 



>E 



|ti^i 1 (a)pi\ 



V E 



I To^o 1 ( a )Pi\ 



Repeated use of the inequality ||£2/|| 2 > mineig (yy') ||x|| 2 yields for I = 0, 1, 
\liJ^ l {a)pi\^ > mineig (E [piPi]) mineig (J £ _1 (a)) 2 ||t^|| 



E 



> 



mineig (b'gb^) Ik'Sl 



> 

where the last line follows from the fact that b' e b£ is a scalar. We now show that b'^bg > 0. 
Let fu = Zil{q'T,Zi > 0} and /oi = Zil{q'T,Zi < 0}. Observe that Z{ = fu + foi and 
E [f[J 0l ] = 0, so 

ElfiJu] VE[4/oi] >^E >0 

By the completeness of our series functions, we can represent fu and /oj in terms of the 
series functions. Let 

00 00 

fli = ^ C ljPjif0i = ^2 C °j P ^ 
3=1 3=1 

Without loss of generality, assume the series functions are orthonormal. Then 



E[/ii/ii] =E C ?. E [/oi/oi] =J2 C °3 



Also, 



Thus, 



j'=i 

E [var (h(t)\xi,Zi)] > mineig(6 / 1 6i) V mineig(&o& ) > 



□ 



B.4. Conservative Inference with Discrete Covariates. Let @(x, a) = \6q{x, a), 9\{x, a)] 
and to simplify notation suppress the dependence of G and 0i on (x, a) and let the in- 
struments coincide with x = [x\ X2]' , with x\ = 1 and X2 £ R d_1 . Let £ =E(xx')~ , 
z = x + a [0 77] , with 77 ~ iV (0, 1) and independent of cc and 0£, £ = 0,1, where I denotes 
the identity matrix. Note that E(xx') =E(zx') , and define 

B = T,E(xQ), B = T,E(zO), 

where E (•) denotes the Aumann expectation of the random set in parenthesis, see Molchanov 

(2005, Chapter 2). Denote by B the estimator of B (the unique convex set corresponding to 
the estimated support function) and by _ T % a ball in M. d centered at zero and with radius 
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cwi-r)) with c 7l n_ T ) the Bayesian bootstrap estimate of the 1 — r quantile of / (G [hk (t)]) , 
with f(s(t)) = sup tgT {— s see Section H~3l Following arguments in Beresteanu and 

Molinari (2008, Section 2.3), one can construct a (convex) confidence set C>f? n such that 

sup Q6 _4 (a^(q, a) — crcs n (<l)) = c n (i_ r ) for all q 6 where cta (•) denotes the support 

function of the set A. It then follows that 



lim P sup \a s (q,a) - a C S n (q)\ , =0 = 1 - r. 

Lemma 7. For a given 5 > 0, one can jitter x via z = x + a\j [0 n]' , so as £o obtain a set 
B such that sup ag _4 pu ( B, B I < 5 and 



l-r- 7 («5) > lim P sup \a B (q, a) - {a CS M + S)\ + = > 1 - r, (B.12) 



w/iere 7 (5) = P (su PteT {— G [/i fc (i)]} + > 25) 



Proof. Observe that p H yB,BJ = p H (EE (zQ) ,SE(x6)) . By the properties of the Au- 
mann expectation (see, e.g., Molchanov (2005, Theorem 2.1.17)), 

p H (EE (*e) , EE (are)) < E [p H (S (z9) , S (x6))] . 

In turn, 

supE[p H (E(2G),E(x6))] 



sup E 



sup 

v=T,'q:\\v\\=l 



SUp (vi + 2 2 « 2 ) 8 - SUp (t>i + X2V2) < 

6»ee 6»ee 



sup E 

a£A 



sup 

v=X'q:\\v\\=l 



(vi + x 2 v 2 + crnn 2 ) (6> 1 («i + x 2 t> 2 + 0"nn 2 < 0) + 6"i 1 («i + x 2 i> 2 + o"ni; 2 > 0)) 
(i>i + x 2 v 2 ) (6> 1 (vi + x 2 t> 2 < 0) + f?i 1 («i + x 2 -y 2 > 0)) 



< sup E 

aeA 



sup \arjV2 {0 l (v\ + x 2 f 2 + o"ni; 2 < 0) + 6\1 (vi + x 2 u 2 + ai]v 2 > 0))| 

ti=S'q:||n||=l 



+ sup E 

aeA 



sup |(«i + x 2 v 2 ) (9 1 - 60) (1 (0 < - (v\ + X2V2) < or\V2) — 1 (0 < V\ + x 2 t> 2 < — crnv 2 ))| 

v=Y.' q:\\v\\=l 



< crE |r/| ( sup E \0q(x, a)\ + sup E \9\{x, a)\ + sup E \0\(x, a) — 0q(x, a)\ J . 

a6»4 



Hence, we can choose as 



E \v\{sup aeA E\e (x,a)\+s\ip aeA E\e 1 (x,a)\+s\ip aeA E\e 1 (x,a)-e (x,a)\) ' 
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Now observe that because sup ae ^pH \ B,Bj < 5, we have B(a) C B(a) © B$ for all 
a G A, where "0" denotes Minkowski set summation, and therefore 



sup (a B (q, a) - cr C s n (q)) < V q G S 
a<=A 



d-l 



d-l 



=>• sup {a B (q, a) - (a C s n (q) + $)) < V q G S' 

from which the second inequality in (|B.12p follows. Notice also that B(a) C B{a) © B$ for 
all a £ A, and therefore 



sup (a B (q,a) - (a C s n (q) + 8)) < OVgGS 



d-l 



d-l 



=^ sup (a B (q, a) - (a CSn (q) + 25)) < V q G 5 

from which the first inequality in (|B.12|) follows. Because 5 > is chosen by the researcher, 
inference is arbitrarily slightly conservative. Note that a similar argument applies if one 
uses a Kolmogorov statistic rather than a directed Kolmogorov statistic. Moreover, the 
Hausdorff distance among convex compact sets is larger than the L p distance among them 
(see, e.g., Vitale (1985, Theorem 1)), and therefore a similar conclusion applies for Cramer- 
Von-Mises statistics. □ 

B.5. Lemmas on Entropy Bounds. We collect frequently used facts in the following 
lemma. 

Lemma 8. Let Q be any probability measure whose support concentrates on a finite set. 

(1) Let J- be a measurable VC class with a finite VC index k or any other class whose 
entropy is bounded above by that of such a VC class, then its entropy obeys 

logN(e\\F\\ Q:2 ,T,L 2 (Q)) < 1 + *log(l/e) 

Examples include e.g., linear functions J- = {a'wi,a G M. k , \\a\\ < C} and their 
indicators T = {l{a'wi > 0}, a G M fc , ||a|| < C}. 

(2) Entropies obey the following rules for sets created by addition, multiplication, and 
unions of measurable function sets JF and J- 1 : 

log N{e\\F + F'\\ Qj2 , T + F' , L 2 (Q)) < B 
logN(e\\F-F'\\ Qt2 ,F-F',L 2 (Q)) < B 
\ogN(e\\F VF'\\ Q , 2 ,TU F',L 2 (Q)) < B 

e 2/r^\\ , i„„ i\t ( e II 17' 1 1 _ _ t> t2, 



B = logN(-\\F\\ Q , 2 ,F,L 2 (Q)) +logN(-\\F'\\ Q , 2l F',L 2 (Q)). 

(3) Entropies are preserved by multiplying a measurable function class F with a random 
variable gi~. 

log AT(e|| \g\F\\ Q , 2 ,gF, L 2 (Q)) < log N {e/2\\F\\ Qy2 , F, L 2 (Q)) 
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(4) Entropies are preserved by integration or taking expectation: for f* (x) := J f(x,y)d[i(y) 
where /i is some probability measure, 

logN(e\\F\\ Q , 2 ,F*,L 2 (Q)) <logN(e\\F\\ Q , 2 ,F,L 2 (Q)) 

Proof. For the proof of (l)-(3) see e.g., Andrews (1994). For the proof of (4), see e.g., 
Ghosal, Sen, and van der Vaart (2000, Lemma A2). □ 

Next consider function classes and their envelops 

U x = {q h £E[z iP ' i l{q'Xz i <Q}]J G \a) Pi ^{a),t G T}, H x < 
U 2 = {q'T,E[zi Pi l{q'T,Zi > Q}]J^ x (a)pnpii(a),t G T}, H 2 < 
rl 3 = {q'T,XizlEE[ziW itql ^(a)] ,t eT}, F 3 < ||a;»||||^|| 
rL A = {q'Y,ZiW iiq 'x(a),t G T}, H 4 < \\zi\\F 2 

^3 = {/i'r 1 (o) Wi (a) ) ae4 < efc-^i, (B.13) 
where /i' is defined in equation ()B.9j) . 

Lemma 9. 1. (a) The following bounds on the empirical entropy apply 

logAr(e||#i|| Pni2 ,Hi,L 2 (P n )) < P logn + log(l/e) 
logiV(e||F 2 || Pn , 2 ,^ 2 ,L 2 (P n )) < P logn + log(l/e) 
logAr(e||F 3 || Pn ,2, F 3 ,L 2 (F n )) < P log n + log(l/e) 

(b) Moreover similar bounds apply to function classes gi(Hf — rl°) with the envelopes given 
by \gi\4H£, where gi is a random variable. 

2. (a) The following bounds on the uniform entropy apply 



sup log 

Q 


■N(e 


\Hi 


Q,2 




L\Q)) < Hog(l/e) 


sup log 

Q 


■N(e 


\H 2 


Q,2 




£ 2 (Q)) < Hog(l/e) 


sup log 

Q 


■N(e 


1*3 1 


Q,2i 


^"3,^ 2 (Q))<A:log(l/e) 


sup log 

0. 


■N(e 


\H 3 


Q,2 


^3 


£ 2 (Q)) < log(l/e) 


sup logiV(e 
Q 


\H 4 


|Q,2 


%4 


L 2 (Q))<log(l/e). 



(b) Moreover similar bounds apply to function classes gi(Ji° — Hf) with the envelopes given 
by \gi\AHl, where gi is a random variable. 

Proof. Part 1 (a). Case of rl\ and 7i 2 . We shall detail the proof for this case, while 
providing shorter arguments for others, as they are simpler or similar. 

Note that Hi C M\ ■ M 2 ■ T\, where M\ = {q'T,Zi,q G S^ 1 } with envelope Mi = ||zj|| 
is VC with index dim(zj) + dim(xj), and M. 2 = {j(q)JQ 1 (a)pi,(q,a) G S 11 " 1 x .4} with 
envelope M 2 < ||£fc||, T\ = {ipio(a),a G A} with envelope F\, where 7(g) is uniformly 
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logL ln < P logra and logL 2n , <p 1. 



Note that log£fc < logn by assumption, sup Q6 ^ || J~ 1 (a)|| < 1 by assumption, ||E n [pip^]|| <p 
1 by Lemma [TUJ The sets and A are compact subsets of Euclidian space of fixed 

dimension, and so can be covered by a constant times l/e c balls of radius e for some 
constant c > 0. Therefore, we can conclude 



Repeated application of Lemma [8j yields the conclusion, given the assumption on the func- 
tion class T\. The case for %2 is very similar. 

Case of J-3. Note that C M2 ■ J~i an d \\p.\\ = op(l) by Step 4 in the proof of Lemma 
[TJ Repeated application of Lemma [8j yields the conclusion, given the assumption on the 
function class F\. 

Part 1 (b). Note that T~L° = % — E[W], so it is created by integration and summation. 
Hence repeated application of Lemma [8] yields the conclusion. 

Part 2. (a) Case of %i,% 2 , an d J~3- Note that all of these classes are subsets of 
{n'pi, \\fJ.\\ < C} • T\ with envelope ^F\. The claim follows from repeated application 
of Lemma [H 

Case of H3. Note that ^3 C {g'Sxj^/x, \\fj,\\ < C} with envelope The claim 

follows from repeated application of Lemma [HI 

Case of H.^. Note that H4 is a subset of a function class created from taking the class 
T2 multiplying it with indicator function class ljg'Szj > 0,(j £ S d ~ 1 } and with function 
class {q'T>Zi,q £ and then adding the resulting class to itself. The claim follows from 

repeated application of Lemma [8j 

Part 2 (b). Note that T~L° = % — E[%°], so it is created by integration and summation. 
Hence repeated application of Lemma [8j yields the conclusion. 



B.6. Auxiliary Maximal and Random Matrix Inequalities. We repeatedly use the 
following matrix LLN. 

Lemma 10 (Matrix LLN). Let Qi,...,Q n be i.i.d. symmetric non-negative matrices such 
that Q = EQi and \\Qi\\ < M , then for Q = E n <5i 



logiV(e||M 2 || Pni2 ,.M 2 ,L 2 (IP n )) < P logn + log(l/e). 



□ 




In particular, if Qi = p%p\, with \\pi\\ < then 
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Proof. This is a variant of a result from Rudelson (1999). By the symmetrization lemma, 



A := E 



Q-Q 



<2EE e ||E n [e l Q l 



where q are Rademacher random variables. The Khintchine inequality for matrices, which 
was shown by Rudelson (1999) to follow from the results of Lust-Piquard and Pisier (1991), 
states that 

l2i\l/2 



log k 



ir 



f MQt]y 



Since (remember that ||-|| is the operator norm) 



E 

and 

one has 

Solving for A gives 



(En[Q?]) 



2l\l/2 



E||(E n [Q? 



1 1/2 



< [ME ||E n Qi 



,1/2 



|EnQi|| < A+IIQH 



n 



A < 



'4M||Q||logfc / Mlogk 



+ 



n 



n 



+ 



M log k 



n 



□ 



which implies the result stated in the lemma if M1 ° gfc < 1. 
We also use the following maximal inequality. 

Lemma 11. Consider a separable empirical process G n (f) = n~ 1 / 2 ^2™ =1 {f(Zi)—~E[f(Zi)]}, 
where Z±, . . . , Z n is an underlying independent data sequence on the space (f2, G, P), defined 
over the function class T , with an envelope function F > 1 such that log[maxj< n ||.F||] <p 
logra and 



logN[e\\F\L 2 ,T,L 



< t?mlog(re/e), < e < 1, 



with some constants < logK < logn, m potentially depending on n, and 1 < v < 1. For 
any 5 £ (0, 1), there is a large enough constant K$, such that for n sufficiently large, then 



P < sup |G n (/)| < K S \/m logn max < sup ||/(Zj)||p )2 , sup ||/||p n ,2 



> 1 -5. 



Proof. TO BE ADDED. This is a restatement of Lemma 19 from Belloni and Chernozhukov 
(2009b). □ 
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